ACED-IDP / gen3_util

Collection of command line tools to interact with a Gen3 instance
MIT License
3 stars 1 forks source link

refactor, use git #77

Closed bwalsh closed 2 months ago

bwalsh commented 4 months ago

This PR:

Known issues:

bwalsh commented 4 months ago

@lbeckman314 Thanks for the well documented review.

Can we review the following tomorrow?

➜ g3t status ⚠️ Error when initial file is added
# max() iterable argument is empty

➜ g3t add date.txt

➜ g3t status ⚠️ Could the META directory automatically be created with the `g3t init` command?
# [Errno 2] No such file or directory: 'META/DocumentReference.ndjson'

The META directory is/should be created with the g3t init command.

bwalsh commented 4 months ago

@lbeckman314

Also re. Use g3t entirely. "What would a human do?" — most users will expect to use one command and not have to remember to switch between g3t/git. g3t doesn't have to support 100% of all git commands, but rather simply the most common commands expected for interacting with Gen3:

All of that should be in place now no?

bwalsh commented 4 months ago

'job still running' message. : that is an output of the gen3 client lib. Will have to look into how to suppress it. Can this output be truncated/removed (or hidden behind a '--verbose' flag. : for sure


# TOTAL                   | 1 ⚠️ 'Publishing' message is mixed up with 'job still running' message. Can the latter be hidden behind a '--verbose' flag? 
# Publishing /[2024-05-14 12:22:46,485][   INFO] job still running, waiting for 3 seconds...
# Publishing \[2024-05-14 12:22:49,523][   INFO] {'uid': '0722dd6c-5eab-4e47-b0cf-e8145ee8aa78', 'name': 'fhir-import-export-hduoz', 'status': 'Running'}
# ...
# [2024-05-14 12:24:23,344][   INFO] job still running, waiting for 51.2578125 seconds...
# Publishing /[2024-05-14 12:25:14,640][   INFO] {'uid': '0722dd6c-5eab-4e47-b0cf-e8145ee8aa78', 'name': 'fhir-import-export-hduoz', 'status': 'Completed'}
# [2024-05-14 12:25:14,641][   INFO] Job is finished!
# Published project ⚠️ Can this output be truncated/removed (or hidden behind a '--verbose' flag)?
# {'output': {'output': '{"user":"beckmanl@ohsu.edu","files":[....```
lbeckman314 commented 4 months ago

@lbeckman314

Also re. Use g3t entirely. "What would a human do?" — most users will expect to use one command and not have to remember to switch between g3t/git. g3t doesn't have to support 100% of all git commands, but rather simply the most common commands expected for interacting with Gen3:

All of that should be in place now no?

Ah you're entirely right my mistake!

lbeckman314 commented 4 months ago

@lbeckman314 Thanks for the well documented review.

Can we review the following tomorrow?

Absolutely!

bwalsh commented 4 months ago

Renamed package name. Discovered an obscure python package already called g3t https://pypi.org/project/g3t/ Does not affect cli, still named g3t

quinnwai commented 4 months ago

Testing g3t push

Wanted to share an SSL certificate error I got an resolved when running g3t push in case @bwalsh you wanted to add it to the docs.

Error

Cannot connect to host aced-training.compbio.ohsu.edu:443 ssl:default [Connect call failed ('192.168.205.2', 443)]

Solution

Check how Python was installed with which python3. If python is in \Library, then you likely downloaded it from the website and will need to update certificates manually, as the install doesn't do it for you:

bash /Applications/Python 3.12/Install Certificates.command

Alternatively, you switch to miniconda or brew which should have the SSL certs already pre-installed.

lbeckman314 commented 3 months ago

Overview

Testing importing files from a foreign bucket into Gen3 Commons:

*Note: latest version of g3t is 0.0.4rc7. Can try rerunning tests with latest versions if that would be recommended!

Installing ✅

➜ gh pr checkout 77
Switched to branch 'feature/git'
Your branch is up to date with 'upstream/feature/git'.
Already up to date.

➜ pip install -e .

➜ g3t --version
g3t, version 0.0.4rc5

➜ mkdir cbds-gdc_lung
➜ cd cbds-gdc_lung

➜ g3t ping
# msg: 'Configuration OK: Connected using profile:cbds-development'
# endpoint: https://development.compbio.ohsu.edu
# username: beckmanl@ohsu.edu

Creating Project ✅

➜ g3t init cbds-gdc_lung
➜ Initialized empty Git repository in /Users/beckmanl/code/gen3_util/cbds-gdc_lung/.git/
# [main (root-commit) 1cf4946] initialized
#  6 files changed, 43 insertions(+)
#  create mode 100644 .g3t/README.md
#  create mode 100644 .g3t/config.yaml
#  create mode 100644 .g3t/work/.gitignore
#  create mode 100644 .gitignore
#  create mode 100644 MANIFEST/README.md
#  create mode 100644 META/README.md
# To approve the project, a privileged user must run `g3t projects create` and `g3t collaborator approve --all`

➜ g3t projects create
# msg: OK created /programs/cbds/projects/gdc_lung

➜  g3t collaborator approve --all
# approved:
# - policy_id: programs.cbds.projects.gdc_lung_writer
#   request_id: 609db1cb-7809-4974-9eed-29453a03ed6a
#   status: SIGNED
#   username: beckmanl@ohsu.edu
# - policy_id: programs.cbds.projects.gdc_lung_reader
#   request_id: 09c055da-c911-4048-9fee-22959eb9a28b
#   status: SIGNED
#   username: beckmanl@ohsu.edu
# msg: OK

Adding File ✅

➜ head -n 1 cedar-import-gdc-lung.sh
# g3t add s3://gdc-lung/004f4ac8-5b03-4ad1-87ce-191678a71af7/004f4ac8-5b03-4ad1-87ce-191678a71af7/TCGA-69-8253-10A-01D-A92Q-36.WholeGenome.RP-1657.bai --etag 3a107c3f78107a8104225dc170fa351f --size 9550640 --modified 2024-02-23T12:38:01

➜ g3t add s3://gdc-lung/004f4ac8-5b03-4ad1-87ce-191678a71af7/004f4ac8-5b03-4ad1-87ce-191678a71af7/TCGA-69-8253-10A-01D-A92Q-36.WholeGenome.RP-1657.bai --etag 3a107c3f78107a8104225dc170fa351f --size 9550640 --modified 2024-02-23T12:38:01

➜ g3t status
# DocumentReference.ndjson is out of date. The most recently changed file is MANIFEST/gdc-lung/004f4ac8-5b03-4ad1-87ce-191678a71af7/004f4ac8-5b03-4ad1-87ce-191678a71af7/TCGA-
# 69-8253-10A-01D-A92Q-36.WholeGenome.RP-1657.bai.dvc.  Please run `g3t meta init`
# No data file changes.
# On branch main
# Changes to be committed:
#   (use "git restore --staged <file>..." to unstage)
#         new file:   MANIFEST/gdc-lung/004f4ac8-5b03-4ad1-87ce-191678a71af7/004f4ac8-5b03-4ad1-87ce-191678a71af7/TCGA-69-8253-10A-01D-A92Q-36.WholeGenome.RP-1657.bai.dvc
# 
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#         cedar-import-gdc-lung.sh

➜ g3t meta init
# Generating \cbds-gdc_lung/ResearchStudy/https://aced-idp.org/cbds-gdc_lung|cbds-gdc_lung
# Updated 2 metadata files.
# resources={'summary': {'DocumentReference': 1, 'ResearchStudy': 1}} exceptions=[]

➜ g3t ls
# bucket:
# - did: b1692835-3c3a-535d-bee1-d60002a549cb
#   file_name: _cbds-gdc_lung-20240523-164736_meta.zip
#   indexd_created_date: '2024-05-23T23:47:36.715261'
#   meta: {}
#   urls:
#   - s3://cbds/b1692835-3c3a-535d-bee1-d60002a549cb/_cbds-gdc_lung-20240523-164736_meta.zip
# - did: c3feb34c-600f-5130-96f8-beff199140f5
#   file_name: .g3t/work/cbds-gdc_lung.git.zip
#   indexd_created_date: '2024-05-23T23:47:36.087147'
#   meta: {}
#   urls:
#   - s3://cbds/c3feb34c-600f-5130-96f8-beff199140f5/.g3t/work/cbds-gdc_lung.git.zip
# - did: f7c4213a-4c6d-5087-b1df-6d791645f89f
#   file_name: cbds-gdc_lung_20240523-234747_SNAPSHOT.zip
#   indexd_created_date: '2024-05-23T23:47:47.404704'
#   meta: {}
#   urls:
#   - s3://cbds/f7c4213a-4c6d-5087-b1df-6d791645f89f/cbds-gdc_lung_20240523-234747_SNAPSHOT.zip
# committed: []
# uncommitted:
# - meta:
#     hash: etag
#     no_bucket: false
#   outs:
#   - etag: 3a107c3f78107a8104225dc170fa351f
#     hash: etag
#     is_symlink: false
#     mime: application/octet-stream
#     modified: '2024-02-23T12:38:01+00:00'
#     path: gdc-lung/004f4ac8-5b03-4ad1-87ce-191678a71af7/004f4ac8-5b03-4ad1-87ce-191678a71af7/TCGA-69-8253-10A-01D-A92Q-36.WholeGenome.RP-1657.bai
#     size: 9550640
#     source_url: s3://gdc-lung/004f4ac8-5b03-4ad1-87ce-191678a71af7/004f4ac8-5b03-4ad1-87ce-191678a71af7/TCGA-69-8253-10A-01D-A92Q-36.WholeGenome.RP-1657.bai

➜ g3t commit -m "Add initial GDC Lung file"
# [main f05040e] Add initial GDC Lung file
#  3 files changed, 14 insertions(+)
#  create mode 100644 MANIFEST/gdc-lung/004f4ac8-5b03-4ad1-87ce-191678a71af7/004f4ac8-5b03-4ad1-87ce-191678a71af7/TCGA-69-8253-10A-01D-A92Q-36.WholeGenome.RP-1657.bai.dvc
#  create mode 100644 META/DocumentReference.ndjson
#  create mode 100644 META/ResearchStudy.ndjson

Publishing File ⚠️

➜ g3t push

➜ cat logs/publish.log
# Scanned new: 1, updated: 0 files
# Indexed 1 files.
# Uploading 1 files via gen3
# gen3-client upload-multiple --manifest .g3t/work/manifest-20240523165609.json --profile cbds-development --upload-path /Users/beckmanl/code/gen3_util/cbds-gdc_lung --bucket
#  cbds --numparallel 9
# 2024/05/23 16:56:09 A new version of gen3-client is available! The latest version is 2024.6.0. You are using version 2023.11
# 2024/05/23 16:56:09 Please download the latest gen3-client release from https://github.com/uc-cdis/cdis-data-client/releases/latest
# Notice: this is the upload method which requires the user to provide GUIDs. In this method files will be uploaded to specified GUIDs.
# If your intention is to upload files without pre-existing GUIDs, consider to use "./gen3-client upload" instead.
# 
# 2024/05/23 16:56:09 The file you specified "/Users/beckmanl/code/gen3_util/cbds-gdc_lung/gdc-lung/004f4ac8-5b03-4ad1-87ce-191678a71af7/004f4ac8-5b03-4ad1-87ce-191678a71af7/
# TCGA-69-8253-10A-01D-A92Q-36.WholeGenome.RP-1657.bai" does not exist locally.
# 
# 
# Submission Results
# Finished with 0 retries | 0
# Finished with 1 retry   | 0
# TOTAL                   | 0
# Published project. See logs/publish.log

➜ cat logs/publish.log | jq

Publish Logs ⚠️

{
  "timestamp": "2024-05-23T23:56:24.756550+00:00",
  "output": {
    "user": null,
    "files": [
      "/root/studies/gdc_lung/commits/b0b51a03-6744-5833-a6ce-6de90e4b2978/DocumentReference.ndjson",
      "/root/studies/gdc_lung/commits/b0b51a03-6744-5833-a6ce-6de90e4b2978/ResearchStudy.ndjson"
    ],
    "logs": [
      "HAS RESOURCE /programs/cbds/projects",
      "CAN CREATE: True",
      "INPUT DATA:  {'method': 'put', 'project_id': 'cbds-gdc_lung', 'push': {'commits': [{'commit_id': 'b0b51a03-6744-5833-a6ce-6de90e4b2978', 'exceptions': None, 'logs': None, 'manifest_sqlite_path': None, 'message': 'From g3t-git', 'meta_path': '_cbds-gdc_lung-20240523-165609_meta.zip', 'object_id': 'b0b51a03-6744-5833-a6ce-6de90e4b2978', 'path': None, 'resource_counts': None}], 'published_job': None, 'published_timestamp': None}}",
      "DOWNLOADED b0b51a03-6744-5833-a6ce-6de90e4b2978 /root/studies/gdc_lung/commits/b0b51a03-6744-5833-a6ce-6de90e4b2978",
      "COMMAND:  ['unzip', '-o', '-j', 'downloads/_cbds-gdc_lung-20240523-165609_meta.zip', '-d', '/root/studies/gdc_lung/commits/b0b51a03-6744-5833-a6ce-6de90e4b2978']",
      "UNZIPPED /root/studies/gdc_lung/commits/b0b51a03-6744-5833-a6ce-6de90e4b2978",
      "Study not Simplified. Simplifying Study...",
      "An Exception Occurred: 'NoneType' object has no attribute 'upper'", ⚠️
      "LOADED gdc_lung",
      "HAS RESOURCE /programs/cbds/projects/gdc_lung",
      "HAS SERVICE read-storage on resource /programs/cbds/projects/gdc_lung",
      "Uploaded /tmp/tmpz4ctjlkq/cbds-gdc_lung_20240523-235619_SNAPSHOT.zip to cbds d1306eba-8c77-53e0-b726-9d2ba0037a69"
    ],
    "snapshot": {
      "object_id": "d1306eba-8c77-53e0-b726-9d2ba0037a69"
    }
  }
}

Portal Check

Arborist/Profile Page ✅

Explorer Page ⚠️


Let me know if any of the issues above may be due to local misconfigurations. Can also test with latest g3t versions!

lbeckman314 commented 3 months ago

Overview

All tests passing. LGTM! 👍

Testing importing files from a foreign bucket into Gen3 Commons:

Installing ✅

➜ gh pr checkout 77
# Switched to branch 'feature/git'
# Your branch is up to date with 'upstream/feature/git'.
# Already up to date.

➜ pip install -e .

➜ g3t --version
# g3t, version 0.0.4rc8

➜ g3t ping
# msg: 'Configuration OK: Connected using profile:local'
# endpoint: https://aced-training.compbio.ohsu.edu
# username: beckmanl@ohsu.edu

Cleaning/Deleting Project ✅

➜ g3t projects empty --project_id cbds-gdc_lung --confirm empty
# [2024-05-28 12:23:02,160][   INFO] job still running, waiting for 3 seconds...
# [2024-05-28 12:23:05,212][   INFO] {'uid': '4f6bc396-53e8-4bb5-b761-b9056b77184d', 'name': 'fhir-import-export-akqim', 'status': 'Running'}
# ...
# [2024-05-28 12:24:04,782][   INFO] Job is finished!
# output: '{"user":"beckmanl@ohsu.edu","files":[],"logs":["HAS RESOURCE /programs/cbds","HAS
#   RESOURCE /programs/cbds/projects","HAS SERVICE delete on resource /programs/cbds/projects/gdc_lung","EMPTIED
#   graph for cbds-gdc_lung","EMPTIED flat for cbds-gdc_lung","EMPTIED FHIR STORE for
#   cbds-gdc_lung"]}'
# msg: Emptied cbds-gdc_lung

➜ g3t projects rm --project_id cbds-gdc_lung
# endpoint: https://aced-training.compbio.ohsu.edu
# messages:
# - Deleted cbds-gdc_lung
# msg: OK

Creating Project ✅

➜ g3t init cbds-gdc_lung --approve
# msg: OK created /programs/cbds/projects/gdc_lung
# 
# approved:
# - policy_id: programs.cbds.projects.gdc_lung_writer
#   request_id: dda0da6e-9d66-4c49-b774-678fb372803f
#   status: SIGNED
#   username: beckmanl@ohsu.edu
# - policy_id: programs.cbds.projects.gdc_lung_reader
#   request_id: 2160183b-d03b-4810-96f5-2d5c3a627caf
#   status: SIGNED
#   username: beckmanl@ohsu.edu
# msg: OK

Adding Files ✅

➜ cat cedar-import-gdc-lung.sh | sed 's/g3t//' | xargs -L 1  -P 8 g3t

➜ git add MANIFEST/

➜ g3t meta init
# Updated 5 metadata files.
# resources={'summary': {'DocumentReference': 115, 'Specimen': 54, 'ResearchStudy': 1, 'ResearchSubject': 54, 'Patient': 54}} exceptions=[]

➜ g3t commit -m  "Add initial GDC Lung file"
# [main c5361b6] None
#  5 files changed, 278 insertions(+)
#  create mode 100644 META/DocumentReference.ndjson
#  create mode 100644 META/Patient.ndjson
#  create mode 100644 META/ResearchStudy.ndjson
#  create mode 100644 META/ResearchSubject.ndjson
#  create mode 100644 META/Specimen.ndjson

Publishing File ✅

➜ g3t push
# Scanned new: 115, updated: 0 files
# Indexed 115 files.
# Checking 115 files for upload via gen3
# No files to upload to gen3 by gen3-client.
# Published project. See logs/publish.log

➜ cat logs/publish.log | jq

Publish Logs ✅

{
  "timestamp": "2024-05-28T19:42:40.352528+00:00",
  "output": {
    "user": "beckmanl@ohsu.edu",
    "files": [
      "/root/studies/gdc_lung/commits/ad198f45-3772-5a3a-915f-e5888686c5cb/ResearchStudy.ndjson",
      "/root/studies/gdc_lung/commits/ad198f45-3772-5a3a-915f-e5888686c5cb/ResearchSubject.ndjson",
      "/root/studies/gdc_lung/commits/ad198f45-3772-5a3a-915f-e5888686c5cb/Specimen.ndjson",
      "/root/studies/gdc_lung/commits/ad198f45-3772-5a3a-915f-e5888686c5cb/DocumentReference.ndjson",
      "/root/studies/gdc_lung/commits/ad198f45-3772-5a3a-915f-e5888686c5cb/Patient.ndjson"
    ],
    "logs": [
      "HAS RESOURCE /programs/cbds",
      "HAS RESOURCE /programs/cbds/projects",
      "HAS SERVICE create on resource /programs/cbds/projects/gdc_lung",
      "CAN CREATE: True",
      "DOWNLOADED ad198f45-3772-5a3a-915f-e5888686c5cb /root/studies/gdc_lung/commits/ad198f45-3772-5a3a-915f-e5888686c5cb",
      "UNZIPPED /root/studies/gdc_lung/commits/ad198f45-3772-5a3a-915f-e5888686c5cb",
      "Simplifying study: /root/studies/gdc_lung/commits/ad198f45-3772-5a3a-915f-e5888686c5cb",
      "Writing to metadata-service",
      "Loaded discovery study cbds-gdc_lung",
      "Loaded <_io.TextIOWrapper name='/root/studies/gdc_lung/commits/ad198f45-3772-5a3a-915f-e5888686c5cb/extractions/ResearchStudy.ndjson' mode='r' encoding='UTF-8'>",
      "wrote /root/studies/gdc_lung/commits/ad198f45-3772-5a3a-915f-e5888686c5cb/ResearchStudy.ndjson to elasticsearch/fhir",
      "wrote /root/studies/gdc_lung/commits/ad198f45-3772-5a3a-915f-e5888686c5cb/ResearchSubject.ndjson to elasticsearch/fhir",
      "wrote /root/studies/gdc_lung/commits/ad198f45-3772-5a3a-915f-e5888686c5cb/Specimen.ndjson to elasticsearch/fhir",
      "wrote /root/studies/gdc_lung/commits/ad198f45-3772-5a3a-915f-e5888686c5cb/DocumentReference.ndjson to elasticsearch/fhir",
      "wrote /root/studies/gdc_lung/commits/ad198f45-3772-5a3a-915f-e5888686c5cb/Patient.ndjson to elasticsearch/fhir",
      "HAS RESOURCE /programs/cbds",
      "HAS RESOURCE /programs/cbds/projects",
      "HAS SERVICE read-storage on resource /programs/cbds/projects/gdc_lung",
      "wrote studies/gdc_lung/ResearchStudy.ndjson",
      "wrote studies/gdc_lung/ResearchSubject.ndjson",
      "wrote studies/gdc_lung/Specimen.ndjson",
      "wrote studies/gdc_lung/DocumentReference.ndjson",
      "wrote studies/gdc_lung/Patient.ndjson",
      "_get discovery study: {}",
      "Uploaded /tmp/tmpih1hqx1p/cbds-gdc_lung_20240528-194222_SNAPSHOT.zip to data-import-test 8f1f49ef-ddd2-551c-8c01-b629cb38e9ad"
    ],
    "snapshot": {
      "object_id": "8f1f49ef-ddd2-551c-8c01-b629cb38e9ad"
    }
  }
}

Portal Check

Explorer Page ✅

Observation Tab ✅

Correct number of Observations.

Patient Tab ✅

Correct number of Patients.

File Download Page ✅

Selecting the 'Download' button successfully started downloading the file.


Thanks for the update Walsh, everything's looking good!

matthewpeterkort commented 2 months ago

It would be nice to have some sort of foreign key check of reference ids to known data files outside of the supported FHIR DVC data formats.

Or, we should deprecate the meta graph load server side, so that these errors don't block g3t uploads from users who are unaware that their FHIR dataset is incomplete, since the util doesn't really warn them ahead of time anymore

matthewpeterkort commented 2 months ago

Multiple definitions of the exact same function name in different directories can get very confusing to a new set of eyes looking at the code. Ex: ls( function and others.

Would make it easier on the eyes to have unique function names across the entire package