galaxyproteomics / tools-galaxyp

Galaxy Tool Shed repositories maintained and developed by the GalaxyP community
MIT License
34 stars 57 forks source link

Proteogenomics workflow #70

Closed chambm closed 7 years ago

chambm commented 7 years ago

Can someone (@PratikDJagtap) point me to the Galaxy-P proteogenomic workflow into which I should integrate my Omicron tools, e.g. CustomProDB and PSM2SAM? I checked the "Published workflows" section of the public Galaxy-P site and it's not there. We can discuss any design considerations for the fused workflow here.

I see there's a "Tool needed" label; it begs the question, why is there no "Workflow needed" label? Pinging @bgruening because I don't know who better to ask. ;)

bgruening commented 7 years ago

For me a workflow is just a higher level abstraction, based on tools. I added this label, but it would be nice to explicitly state which tools are needed :)

Thanks @chambm!!!

jj-umn commented 7 years ago

@chambm I generally build a workflow and test it in galaxy. All the tools in the workflow should be retrievable from the same toolshed, preferably https://toolshed.g2.bx.psu.edu To publish the workflow in the toolshed, it should have a repository_dependencies.xml that contains all the tool dependencies. @PratikDJagtap Perhaps you and Getiria can work on a demo workflow.

chambm commented 7 years ago

Silly me, I didn't even search the toolshed for workflows. Actually I've never installed a workflow from the toolshed before. :open_mouth: This looks like a good candidate? https://toolshed.g2.bx.psu.edu/view/galaxyp/proteomics_rnaseq_sap_db_workflow/3a11830963e3

bgruening commented 7 years ago

Or this one? https://github.com/bgruening/galaxytools/tree/master/workflows/glimmer3 We accepting PR here :)

chambm commented 7 years ago

This work is aimed at human (possibly mouse) data, so we don't need to infer annotations. That's a whole other bag of genes!

PratikDJagtap commented 7 years ago

Hello Matt and Bjoern,

I will get back to you with answers by evening today.

Regards, Pratik

jj-umn commented 7 years ago

Here's the workflows we used at GCC2016 tutorial: https://github.com/galaxyproteomics/tools-galaxyp/tree/master/workflows/gcc2016_tutorial

bgruening commented 7 years ago

@chambm this was just an example of workflows in git and github with example data and so on :) It's all on the TS as well.

PratikDJagtap commented 7 years ago

Hello Matt,

I will work on workflow components from the workflow that we generated for GCC2016 (mentioned by JJ below). JJ, we can also look at workflows from ABRF2016.

https://github.com/galaxyproteomics/tools-galaxyp/tree/master/workflows/ gcc2016_tutorial

We will need a few hours to come up with a hybrid workflow after discussion with the Galaxy-P team. If required, it will also be a good idea to have a telephone / google hangouts session.

Regards, Pratik

chambm commented 7 years ago

This is a well-written tutorial! https://netfiles.umn.edu/users/pjagtap/ABRF%202016/ABRF_2016_SW4_Galaxy_for_Multi-Omics.pdf

I look forward to hearing your ideas for a hybrid. I worry a bit when I look at the GCC workflow's complexity. Galaxy's job failure handling just isn't robust enough (yet) to handle failures in such a complex workflow. It was a pain to rerun subsets of the Omicron workflow when one step failed halfway through the workflow, and that workflow is about a tenth of the size.

PratikDJagtap commented 7 years ago

Hello Matt,

Yes - we will need to run these as sub-workflows which we did for the workshop and also suggest users to use when running it for their projects.

As you might be aware, Galaxy also offers ability to rerun the subsequent steps in case of a workflow failure (and once the failed job issue is taken care of) so that user need not start from the beginning.

Looking forward to a hybrid OMicron-GalaxyP workflow!

Regards,

Pratik

On Thu, Nov 3, 2016 at 11:18 AM, Matt Chambers notifications@github.com wrote:

This is a well-written tutorial! https://netfiles.umn.edu/ users/pjagtap/ABRF%202016/ABRF_2016_SW4_Galaxy_for_Multi-Omics.pdf

I look forward to hearing your ideas for a hybrid. I worry a bit when I look at the GCC workflow's complexity. Galaxy's job failure handling just isn't robust enough (yet) to handle failures in such a complex workflow. It was a pain to rerun subsets of the Omicron workflow when one step failed halfway through the workflow, and that workflow is about a tenth of the size.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galaxyproteomics/tools-galaxyp/issues/70#issuecomment-258192722, or mute the thread https://github.com/notifications/unsubscribe-auth/AKL0OjkwPqFNOAaRbFmPvMpKW2vRgF7Nks5q6glmgaJpZM4KngwS .

chambm commented 7 years ago

I should clarify that it was failures in a dataset collection that caused the problem. Jobs on single input files could be easily rerun. But jobs where only a few files from a collection failed could not be rerun. Yet using collections reduces the history size by a factor of , e.g. 10-25 fewer history items.

PratikDJagtap commented 7 years ago

Interesting - dataset collection has worked well in our hands. It will be good to exchange notes as we proceed.

Pratik Jagtap,

chambm commented 7 years ago

Ping @tjgriff1

bgruening commented 7 years ago

Galaxy's job failure handling just isn't robust enough (yet) There is a fix in latest dev that makes this more robust if I recall correctly.

chambm commented 7 years ago

Pratik and Tim: the Omicron version of CustomProDB and PSM2SAM depend on a few Galaxy data managers to download the genome FASTA, index it with Bowtie, and download gene annotations from UCSC Table Browser (done inside the CustomProDB R script).

Do you want to keep this architecture or look at some alternate implementation? Currently, the user needs to run these data manager "workflows" first, and refresh the local data tables manually because Galaxy doesn't update it automatically yet. I can see either moving all this reference data into the user's history so it can be a proper part of the workflow, or doing these steps in some kind of initialization step.

The Omicron approach uses Docker to get a custom Galaxy flavor running instantly, but it can't include the reference data without making the Docker image gigantic. So I let the user download this reference data from within Galaxy using the data managers, rather than with an initialization script which runs when the Docker container runs. With the script alternative, it will take hours before the Galaxy flavor is running, but when it does run, it's directly ready for the workflow.

PratikDJagtap commented 7 years ago

Hello Matt,

I am copying @getiria-onsongo, @jj-umn, @tmcgowan on this so that they can comment on which of the approaches would work best - or if any alternative would help to make this process more user-friendly.

Regards, Pratik

On Mon, Nov 7, 2016 at 4:53 PM, Matt Chambers notifications@github.com wrote:

Pratik and Tim: the Omicron version of CustomProDB and PSM2SAM depend on a few Galaxy data managers to download the genome FASTA, index it with Bowtie, and download gene annotations from UCSC Table Browser (done inside the CustomProDB R script).

Do you want to keep this architecture or look at some alternate implementation? Currently, the user needs to run these data manager "workflows" first, and refresh the local data tables manually because Galaxy doesn't update it automatically yet. I can see either moving all this reference data into the user's history so it can be a proper part of the workflow, or doing these steps in some kind of initialization step.

The Omicron approach uses Docker to get a custom Galaxy flavor running instantly, but it can't include the reference data without making the Docker image gigantic. So I let the user download this reference data from within Galaxy using the data managers, rather than with an initialization script which runs when the Docker container runs. With the script alternative, it will take hours before the Galaxy flavor is running, but when it does run, it's directly ready for the workflow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galaxyproteomics/tools-galaxyp/issues/70#issuecomment-258989632, or mute the thread https://github.com/notifications/unsubscribe-auth/AKL0Oq4-lB6VGaBMCH0C6nCWz5tcXvYOks5q76vWgaJpZM4KngwS .

tjgriff1 commented 7 years ago

Hi - here's my thoughts. When using the Docker implementation, it seems reasonable to continue to do this the way you have in the past -- having the user download the reference data needed after the instance is up and running. It seems more reasonable to do it this way and keep the Docker image more agile. As long as we are able to document the process users need to follow to download the reference data and get it processed correctly for DB generation it should be an OK way to approach this.

I'm also envisioning that not all users will be dependent on the Docker image to gain access to these workflows. Some users may have their own Galaxy instance already running or be utilizing cloud infrastructure where instances are already in place (e.g. Jetstream). For these users, we can share the workflows they need to make the pipeline run. These workflows can include the steps needed to create the DB via CustomProDB.

In both cases it will be important to have the workflow documented so they know how and what they need to do to make it work.

Make sense?

On Mon, Nov 7, 2016 at 5:01 PM, Pratik Jagtap notifications@github.com wrote:

Hello Matt,

I am copying @getiria-onsongo, @jj-umn, @tmcgowan on this so that they can comment on which of the approaches would work best - or if any alternative would help to make this process more user-friendly.

Regards, Pratik

Pratik Jagtap,

Research Assistant Professor, Department of Biochemistry, Molecular Biology and Biophysics,

University of Minnesota.

Address: 7-166 MCB, 420 Washington Ave SE, Minneapolis, MN 55455

Phone: 612-624-0381 http://612-624-0381/ Email: pjagtap@umn.edu pjagtap@umn.edu

Twitter: twitter.com/pratikomics http://twitter.com/pratikomics Google Scholar: z.umn.edu/pjgs http://z.umn.edu/pjgs

http://z.umn.edu/galaxypreferences

On Mon, Nov 7, 2016 at 4:53 PM, Matt Chambers notifications@github.com wrote:

Pratik and Tim: the Omicron version of CustomProDB and PSM2SAM depend on a few Galaxy data managers to download the genome FASTA, index it with Bowtie, and download gene annotations from UCSC Table Browser (done inside the CustomProDB R script).

Do you want to keep this architecture or look at some alternate implementation? Currently, the user needs to run these data manager "workflows" first, and refresh the local data tables manually because Galaxy doesn't update it automatically yet. I can see either moving all this reference data into the user's history so it can be a proper part of the workflow, or doing these steps in some kind of initialization step.

The Omicron approach uses Docker to get a custom Galaxy flavor running instantly, but it can't include the reference data without making the Docker image gigantic. So I let the user download this reference data from within Galaxy using the data managers, rather than with an initialization script which runs when the Docker container runs. With the script alternative, it will take hours before the Galaxy flavor is running, but when it does run, it's directly ready for the workflow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galaxyproteomics/tools-galaxyp/issues/70# issuecomment-258989632, or mute the thread https://github.com/notifications/unsubscribe-auth/AKL0Oq4- lB6VGaBMCH0C6nCWz5tcXvYOks5q76vWgaJpZM4KngwS .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galaxyproteomics/tools-galaxyp/issues/70#issuecomment-258991527, or mute the thread https://github.com/notifications/unsubscribe-auth/AKMY3pakdI6_kczomRqlKIrKpIjtC5DQks5q7629gaJpZM4KngwS .

Tim Griffin, Ph.D. Professor, and Director, Center for Mass Spectrometry and Proteomics University of Minnesota Dept. of Biochemistry, Molecular Biology and Biophysics 6-155 Jackson Hall 321 Church Street SE Minneapolis, MN 55455 USA

Office: 7-144 Molecular Cellular Biology (MCB)

Tel: 612-624-5249 Fax: 612-624-0432 Email: tgriffin@umn.edu

https://www.cbs.umn.edu/bmbb/contacts/timothy-j-griffin Center for Mass Spectrometry and Proteomics website: http://www.cbs.umn.edu/msp/

PratikDJagtap commented 7 years ago

My preference would be for the tool to have the option to use prebuilt indices and Shared Data Libraries for reference data.

That would be the model that makes the most sense for institutions such as galaxy main or MSI that run persistent Galaxy instances.

Retaining the option to build these on the fly makes sense when one is using the docker model and only using galaxy as a workflow engine, rather than as a collaboration and data sharing environment.

JJ

chambm commented 7 years ago

Does Galaxy have the necessary data types for indices to be able to share them in data libraries?

jj-umn commented 7 years ago

No. We use Data managers for the index files: fai,2bit,bowtie,bwa,hisat,etc. which get added to the .loc files. We symlink the annotation files: GTF, VCF, etc. for those references in as Shared Data Libraries. These are admin tasks.

tjgriff1 commented 7 years ago

Good point JJ - for the persistent instances it would be really good to have necessary files in the shared data library. Much easier on the user than downloading and/or uploading large files.

On Tue, Nov 8, 2016 at 9:45 AM, Pratik Jagtap notifications@github.com wrote:

On 11/7/16 5:01 PM, Pratik Jagtap wrote:

Hello Matt,

I am copying @getiria-onsongo, @jj-umn, @tmcgowan on this so that they can comment on which of the approaches would work best - or if any alternative would help to make this process more user-friendly.

Regards, Pratik

Pratik Jagtap,

Research Assistant Professor, Department of Biochemistry, Molecular Biology and Biophysics,

University of Minnesota.

/Address: 7-166 MCB, 420 Washington Ave SE, Minneapolis, MN 55455/

/_Phone:_612-624-0381 http://612-624-0381/*Email: *pjagtap@umn.edu mailto:pjagtap@umn.edu /

/Twitter: twitter.com/pratikomics http://twitter.com/pratikomics _Google Scholar:_z.umn.edu/pjgs http://z.umn.edu/pjgs/

http://z.umn.edu/galaxypreferences

On Mon, Nov 7, 2016 at 4:53 PM, Matt Chambers <notifications@github.com mailto:notifications@github.com> wrote:

Pratik and Tim: the Omicron version of CustomProDB and PSM2SAM depend on a few Galaxy data managers to download the genome FASTA, index it with Bowtie, and download gene annotations from UCSC Table Browser (done inside the CustomProDB R script).

Do you want to keep this architecture or look at some alternate implementation? Currently, the user needs to run these data manager "workflows" first, and refresh the local data tables manually because Galaxy doesn't update it automatically yet. I can see either moving all this reference data into the user's history so it can be a proper part of the workflow, or doing these steps in some kind of initialization step.

The Omicron approach uses Docker to get a custom Galaxy flavor running instantly, but it can't include the reference data without making the Docker image gigantic. So I let the user download this reference data from within Galaxy using the data managers, rather than with an initialization script which runs when the Docker container runs. With the script alternative, it will take hours before the Galaxy flavor is running, but when it does run, it's directly ready for the workflow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ galaxyproteomics/tools-galaxyp/issues/70#issuecomment-258989632, or mute the thread https://github.com/notifications/unsubscribe-auth/AKL0Oq4- lB6VGaBMCH0C6nCWz5tcXvYOks5q76vWgaJpZM4KngwS.

My preference would be for the tool to have the option to use prebuilt indices and Shared Data Libraries for reference data.

That would be the model that makes the most sense for institutions such as galaxy main or MSI that run persistent Galaxy instances.

Retaining the option to build these on the fly makes sense when one is using the docker model and only using galaxy as a workflow engine,

rather than as a collaboration and data sharing environment.

JJ

James E. Johnson Minnesota Supercomputing Institute University of Minnesota

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galaxyproteomics/tools-galaxyp/issues/70#issuecomment-259172340, or mute the thread https://github.com/notifications/unsubscribe-auth/AKMY3jyiAmAwrYCUxcml-cT_jkiMFYG4ks5q8JkwgaJpZM4KngwS .

Tim Griffin, Ph.D. Professor, and Director, Center for Mass Spectrometry and Proteomics University of Minnesota Dept. of Biochemistry, Molecular Biology and Biophysics 6-155 Jackson Hall 321 Church Street SE Minneapolis, MN 55455 USA

Office: 7-144 Molecular Cellular Biology (MCB)

Tel: 612-624-5249 Fax: 612-624-0432 Email: tgriffin@umn.edu

https://www.cbs.umn.edu/bmbb/contacts/timothy-j-griffin Center for Mass Spectrometry and Proteomics website: http://www.cbs.umn.edu/msp/

chambm commented 7 years ago

How coupled is SearchGUI/PeptideShaker to the MGF format? If we could keep things as mzML or mz5, preserving nativeID would be a big advantage for tracing downstream ids back to their source spectra.

PratikDJagtap commented 7 years ago

Hello Matt,

Currently MGF is the only input format that SearchGUI seems to support. We can generate a github issue on the SearchGUI and PeptideShaker github site for this.

Regards, Pratik

chambm commented 7 years ago

@jj-umn In the GCC2016 workflow, there is a Select step filtering for ^\d+\tpr.B[^\t,]*(, pr.B[^\t,]*)*\t.*$ - This appears to be related to the Mouse pre-pro-B and pro-B FASTQ files, but the proteomic data I have is named like Mo_Tai_iTRAQ_f5. It looks like it's trying to filter on the protein column in the Peptide Shaker output, but my protein column looks like:

1 0, 1.503, 1.5436, 49.5419, 5.2189, 5.2784, 5.2833 GLLLYGPPGTGK 2 731.5188 SIYYITGESK

So I guess it has something to do with the input FASTA. Is it supposed to select only the non-reference accessions for each PSM?

PratikDJagtap commented 7 years ago

Hello Matt,

If this is a filtering step for the PSM Report output from PeptideShaker - then we used the accession numbers from protein FASTA file (pre-pro-B and pro-B) to parse out the peptides from RNAseq data. Do you have accession numbers associated with these peptide identifications?

@jj-umn https://github.com/jj-umn might be able to provide more information.

Regards, Pratik

Pratik Jagtap,

Research Assistant Professor, Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota.

Address: 7-166 MCB, 420 Washington Ave SE, Minneapolis, MN 55455

*Phone: 612-624-0381 http://612-624-0381/ Email: pjagtap@umn.edu pjagtap@umn.edu *Twitter: twitter.com/pratikomics http://twitter.com/pratikomics

Google Scholar: z.umn.edu/pjgs http://z.umn.edu/pjgs

How Are You – and How’s Your Microbiome? http://z.umn.edu/hayahym

On Tue, Dec 13, 2016 at 1:25 PM, Matt Chambers notifications@github.com wrote:

@jj-umn https://github.com/jj-umn In the GCC2016 workflow, there is a Select step filtering for ^\d+\tpr.B[^\t,](, pr.B[^\t,])\t.$ - This appears to be related to the Mouse pre-pro-B and pro-B FASTQ files, but the proteomic data I have is named like Mo_Tai_iTRAQ_f5. It looks like it's trying to filter on the protein column in the Peptide Shaker output, but my protein column looks like:

1 0, 1.503, 1.5436, 49.5419, 5.2189, 5.2784, 5.2833 GLLLYGPPGTGK 2 731.5188 SIYYITGESK

So I guess it has something to do with the input FASTA. Is it supposed to select only the non-reference accessions for each PSM?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galaxyproteomics/tools-galaxyp/issues/70#issuecomment-266836059, or mute the thread https://github.com/notifications/unsubscribe-auth/AKL0OkyLpI3LLbvlBUxQhGgqzOTzU4Rkks5rHvFBgaJpZM4KngwS .

chambm commented 7 years ago

@jj-umn Is the regex above for selecting only PSMs that map to ONLY non-reference protein sequences?