Publish to cloud tooling providers like Dockstore, AnVIL, etc

whaleyr commented 3 months ago

We want to make PharmCAT easily available on cloud genomics analysis platforms. We already publish a Docker image to Docker Hub so it should be relatively easy to make that image available to different cloud providers. For example, we want to enable access from AnVIL.

After doing some research it seems the best route is publishing a workflow through Dockstore. This will make it available through AnVIL but also DNAStack, DNAnexus, and others.

Current questions are:

How do we get this integrated with our release process?
How do we get stats on usage?
How can we test to ensure availability and success on all the downstream platforms?

markwoon commented 1 month ago

Right now, the PharmCAT-Pipeline workflow only works on a single VCF file and does not handle outside calls file. The way the pipeline script is set up is to use naming conventions of files in the same directory, but this doesn't work because there's no concecpt of a directory in the cloud.

Part 1: accept a outside_call_file parameter for an outside call file. This will limit us to single sample VCFs.

check that this file is using the same basename as the VCF and uses .outside.tsv suffix
if yes, move to data dir
if no, move to data dir and rename

Part 2: accept file parameter

if vcf file, proceed as normal
if compressed file, uncompress and copy contents to data directory (will have to abide by file naming conventions)

Can either provide vcf_file and/or outside_call_file OR file parameter (maybe allow outside_call_file if file points to a VCF file? But that might just be extra complication).

Documentation tasks:

Update docs for WDL to deal with all these cases.
Should link to pipeline script's docs on file naming conventions
Provide an example download that does this

Super bonus part 3: do we want to support URLs in addition to files? There is probably no good reason to do so from the user perspective, but it does mean that we can write tests for the WDL and it will get tested automatically (I think).

markwoon commented 1 month ago

Note: I messed up dockstore integration on last release. It should be fixed for next release though.

AndreRico commented 1 month ago

Proposal for Aligning and Simplifying the PharmCAT Pipeline

To simplify the maintenance of the PharmCAT_Pipeline and ensure it remains robust, I propose we keep the WDL focused on its core functionality of processing a single VCF file at a time. By doing this, we maintain clarity and ease of maintenance in the WDL itself, while offloading the complexity of file management to earlier workflow steps.

For handling issues like multiple files, compressed formats, and file naming conventions, we can delegate these tasks to upstream workflows within Terra or AnVIL. These workflows can manage tasks such as:

Decompressing files if needed.
Mapping multiple files for individual processing.
Renaming or organizing files according to required naming conventions.

By leveraging Terra and AnVIL’s ability to orchestrate custom workflows, users can create preprocessing steps that handle file management and preparation before invoking the PharmCAT_Pipeline for each individual file. This modular approach keeps the pipeline clean and focused while allowing flexibility for diverse file formats and workflows.

Next Suggested Steps:

Use Case Simulations: We can simulate a few use cases involving multiple files, compressed files, and naming conventions. Then, we’ll build workflows that manage these tasks before calling the PharmCAT_Pipeline. This will ensure the process is flexible and can handle different scenarios.

Comprehensive Documentation: We should document these workflows to guide users on how to set up file preprocessing workflows in Terra or AnVIL. This documentation will include examples of how to manage files and call them in the WDL one by one.

This modular approach will reduce the complexity within the pipeline itself, delegating file handling logic to other parts of the workflow, which simplifies both maintenance and usability across multiple platforms.

markwoon commented 1 month ago

Details on file inputs: https://pharmcat.org/using/Running-PharmCAT-Pipeline/#inputs

BinglanLi commented 1 month ago

This is the link to the PharmCAT tutorial. It includes some real-world VCFs and outside call files.

AndreRico commented 2 weeks ago

Hi all, apologies for the delay! I took some time to dive deeper into the PharmCAT_Pipeline code, and it’s clear that it isn’t fully optimized for cloud environments. You had mentioned this issue before, but it really hit home after reviewing the code more closely.

I’m currently working on creating individual WDLs for each of the 4 modules, trying to replicate the logic of the PharmCAT_Pipeline in AnVIL. I’m not entirely sure if we’ll be able to replicate it 100%, but I do think having these modules separated could be valuable for future use cases.

That said, what do you think about developing a version of PharmCAT_Pipeline specifically designed to work in cloud environments?

markwoon commented 2 weeks ago

Yes, the pipeline script is meant as a very simple wrapper around our main tools.

Using it was the quickest way to get going in Dockstore. You're welcome to create a better WDL script, but let's review because maybe we can then enable more functionality.

AndreRico commented 1 week ago

@markwoon, I created a new WDL https://dockstore.org/workflows/github.com/AndreRico/PharmCAT_Dockstore/PharmCAT-VCF_Preprocessor:main?tab=files with two tasks: one to convert the cloud environment into a Path environment, and a second to receive this path environment and run the vcf-preprocessor. I conducted some tests using a txt file pointing to Google Cloud Storage, but I will need a help test other functionalities of the VCF-Preprocessor. I believe we can replicate it for the full Pipeline, adding this task conversion before calling the PharmCAT-Pipeline. I will keep you informed of the progress.

markwoon commented 1 week ago

convert the cloud environment into a Path environment

I assume this is cloud_reader.wdl. I'm not sure I understand why this is necessary. If the files are already in the cloud, then you can pass it directly to the WDL. We just need to accept a file array and users can select multiple files.

On the other hand, now that I'm thinking of this, this would also resolve the original problems I had with the PharmCAT_Pipeline.wdl...

PharmGKB / PharmCAT

Publish to cloud tooling providers like Dockstore, AnVIL, etc #188