Closed bwalsh closed 3 years ago
I think you have identified key steps in making this workflow anvil-ready. I would hope that we would not be moving VCF around (is this implied by data.client.download?) The "system.file" call does not load a file, but it does return a detailed reference to the file (which is installed with the cgdv17 package). We query that file through the tabix index/subsetting/retrieval facilities and never load the whole file. Tabix-indexed VCF can be queried over http and this allows us to query 1000 genomes VCF in AWS S3 without downloading.
@vjcitn thanks for the update
Agree the client should take advantage of any service available to load the data.
The updated pseudo-code:
# query to find vcf files associated with program=X, project=Y,
auth = Gen3Auth(endpoint, refresh_file=refresh_file)
client = Gen3Submission(endpoint, auth)
graphql = {
submitted_genomic_profile(data_format: "VCF", project_id:"DCF-CCLE") {id}
}
response = client.query(graphql)
# process
for vcf in response['data']['submitted_genomic_profile']:
# Load the file via from object store ...
url = client.get_presigned_url(vcf.id)
file = data_client.download(url)
# ... or connect to service.
# ... todo
# .... process file, create data frame
# ...
# ...
# contribute derived data back to project
new_file = as.data.frame(ans)
new_file_id = data_client.upload(new_file)
# associate the file into project
json = { ... }
client.submit_record("DCF", "CCLE", json)
Example use case from https://github.com/vjcitn/bcds#wdl-for-annotating-variants-in-trpv-genes-for-na06985
Expanding on two statements from this example:
Extrapolate a typical example in the anvil/gen3 context pseudo code, lets assume it applies for [R,py,js]
questions
endpoint and refresh_file
passed to task ?data_client
doesn't exist, at the command line the GO utility gen3-client fulfills this function