anvilproject / client-apis

Clients for Python, R, javascript that interact with [terra, gen3, galaxy, others]
Apache License 2.0
9 stars 5 forks source link

Example bioconductor use case #7

Closed bwalsh closed 3 years ago

bwalsh commented 5 years ago

Example use case from https://github.com/vjcitn/bcds#wdl-for-annotating-variants-in-trpv-genes-for-na06985

task doVariantWorkflow {
  command {
    R -e "BiocManager::install('variants', version = '3.9', update=TRUE, ask=FALSE); \
        library('variants'); \
        file <- system.file('vcf', 'NA06985_17.vcf.gz', package = 'cgdv17'); \
        genesym <- c('TRPV1', 'TRPV2', 'TRPV3'); \
        geneid <- select(org.Hs.eg.db, keys=genesym, keytype='SYMBOL', \
                 columns='ENTREZID'); \
        txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene; \
        seqlevelsStyle(txdb) = 'NCBI'; \
        txdb <- keepSeqlevels(txdb, '17'); \
        txbygene = transcriptsBy(txdb, 'gene'); \
        gnrng <- unlist(range(txbygene[geneid[['ENTREZID']]]), use.names=FALSE); \
        names(gnrng) <- geneid[['SYMBOL']]; \
        param <- ScanVcfParam(which = gnrng, info = 'DP', geno = c('GT', 'cPd')); \
        vcf <- readVcf(file, 'hg19', param); \
        ans = locateVariants(vcf, txdb, AllVariants()); \
        table(mcols(ans)[['LOCATION']]); \
                write.csv(as.data.frame(ans), 'trpvar.csv');"
  }
  runtime {
    docker: "bioconductor/devel_core2"
    }
}

Expanding on two statements from this example:

# loads a file from the vcf subdirectory
file <- system.file('vcf', 'NA06985_17.vcf.gz', package = 'cgdv17'); 

Extrapolate a typical example in the anvil/gen3 context pseudo code, lets assume it applies for [R,py,js]

#  query to find vcf files associated with program=X, project=Y, 
auth = Gen3Auth(endpoint, refresh_file=refresh_file)
client = Gen3Submission(endpoint, auth)

graphql = {
  submitted_genomic_profile(data_format: "VCF", project_id:"DCF-CCLE") {id}
}
response = client.query(graphql)
# process 
for vcf in response['data']['submitted_genomic_profile']:
  url = client.get_presigned_url(vcf.id)
  file = data_client.download(url)  
  # .... process file, create data frame
  # ...
  # ... 
  # contribute derived data back to project
  new_file = as.data.frame(ans)
  new_file_id = data_client.upload(new_file)
  # associate the file into project 
   json = { ... }
  client.submit_record("DCF", "CCLE", json)

questions

vjcitn commented 5 years ago

I think you have identified key steps in making this workflow anvil-ready. I would hope that we would not be moving VCF around (is this implied by data.client.download?) The "system.file" call does not load a file, but it does return a detailed reference to the file (which is installed with the cgdv17 package). We query that file through the tabix index/subsetting/retrieval facilities and never load the whole file. Tabix-indexed VCF can be queried over http and this allows us to query 1000 genomes VCF in AWS S3 without downloading.

bwalsh commented 5 years ago

@vjcitn thanks for the update

Agree the client should take advantage of any service available to load the data.

The updated pseudo-code:

#  query to find vcf files associated with program=X, project=Y, 
auth = Gen3Auth(endpoint, refresh_file=refresh_file)
client = Gen3Submission(endpoint, auth)

graphql = {
  submitted_genomic_profile(data_format: "VCF", project_id:"DCF-CCLE") {id}
}
response = client.query(graphql)
# process 
for vcf in response['data']['submitted_genomic_profile']:
  # Load the file via  from object store ...
  url = client.get_presigned_url(vcf.id)
  file = data_client.download(url)   
  # ... or connect to service.
  # ...  todo
  # .... process file, create data frame
  # ...
  # ... 
  # contribute derived data back to project
  new_file = as.data.frame(ans)
  new_file_id = data_client.upload(new_file)
  # associate the file into project 
   json = { ... }
  client.submit_record("DCF", "CCLE", json)