Open whaleyr opened 3 months ago
Right now, the PharmCAT-Pipeline workflow only works on a single VCF file and does not handle outside calls file. The way the pipeline script is set up is to use naming conventions of files in the same directory, but this doesn't work because there's no concecpt of a directory in the cloud.
Part 1: accept a outside_call_file
parameter for an outside call file. This will limit us to single sample VCFs.
.outside.tsv
suffixPart 2: accept file
parameter
Can either provide vcf_file
and/or outside_call_file
OR file
parameter (maybe allow outside_call_file
if file
points to a VCF file? But that might just be extra complication).
Documentation tasks:
Super bonus part 3: do we want to support URLs in addition to files? There is probably no good reason to do so from the user perspective, but it does mean that we can write tests for the WDL and it will get tested automatically (I think).
Note: I messed up dockstore integration on last release. It should be fixed for next release though.
Proposal for Aligning and Simplifying the PharmCAT Pipeline
To simplify the maintenance of the PharmCAT_Pipeline and ensure it remains robust, I propose we keep the WDL focused on its core functionality of processing a single VCF file at a time. By doing this, we maintain clarity and ease of maintenance in the WDL itself, while offloading the complexity of file management to earlier workflow steps.
For handling issues like multiple files, compressed formats, and file naming conventions, we can delegate these tasks to upstream workflows within Terra or AnVIL. These workflows can manage tasks such as:
By leveraging Terra and AnVIL’s ability to orchestrate custom workflows, users can create preprocessing steps that handle file management and preparation before invoking the PharmCAT_Pipeline for each individual file. This modular approach keeps the pipeline clean and focused while allowing flexibility for diverse file formats and workflows.
Next Suggested Steps:
Use Case Simulations: We can simulate a few use cases involving multiple files, compressed files, and naming conventions. Then, we’ll build workflows that manage these tasks before calling the PharmCAT_Pipeline. This will ensure the process is flexible and can handle different scenarios.
Comprehensive Documentation: We should document these workflows to guide users on how to set up file preprocessing workflows in Terra or AnVIL. This documentation will include examples of how to manage files and call them in the WDL one by one.
This modular approach will reduce the complexity within the pipeline itself, delegating file handling logic to other parts of the workflow, which simplifies both maintenance and usability across multiple platforms.
Details on file inputs: https://pharmcat.org/using/Running-PharmCAT-Pipeline/#inputs
This is the link to the PharmCAT tutorial. It includes some real-world VCFs and outside call files.
Hi all, apologies for the delay! I took some time to dive deeper into the PharmCAT_Pipeline code, and it’s clear that it isn’t fully optimized for cloud environments. You had mentioned this issue before, but it really hit home after reviewing the code more closely.
I’m currently working on creating individual WDLs for each of the 4 modules, trying to replicate the logic of the PharmCAT_Pipeline in AnVIL. I’m not entirely sure if we’ll be able to replicate it 100%, but I do think having these modules separated could be valuable for future use cases.
That said, what do you think about developing a version of PharmCAT_Pipeline specifically designed to work in cloud environments?
Yes, the pipeline script is meant as a very simple wrapper around our main tools.
Using it was the quickest way to get going in Dockstore. You're welcome to create a better WDL script, but let's review because maybe we can then enable more functionality.
@markwoon, I created a new WDL https://dockstore.org/workflows/github.com/AndreRico/PharmCAT_Dockstore/PharmCAT-VCF_Preprocessor:main?tab=files with two tasks: one to convert the cloud environment into a Path environment, and a second to receive this path environment and run the vcf-preprocessor. I conducted some tests using a txt file pointing to Google Cloud Storage, but I will need a help test other functionalities of the VCF-Preprocessor. I believe we can replicate it for the full Pipeline, adding this task conversion before calling the PharmCAT-Pipeline. I will keep you informed of the progress.
convert the cloud environment into a Path environment
I assume this is cloud_reader.wdl
. I'm not sure I understand why this is necessary. If the files are already in the cloud, then you can pass it directly to the WDL. We just need to accept a file array and users can select multiple files.
On the other hand, now that I'm thinking of this, this would also resolve the original problems I had with the PharmCAT_Pipeline.wdl
...
We want to make PharmCAT easily available on cloud genomics analysis platforms. We already publish a Docker image to Docker Hub so it should be relatively easy to make that image available to different cloud providers. For example, we want to enable access from AnVIL.
After doing some research it seems the best route is publishing a workflow through Dockstore. This will make it available through AnVIL but also DNAStack, DNAnexus, and others.
Current questions are: