NCBI-Hackathons / Scan2CNV

MIT License
1 stars 0 forks source link

Tentative plan for creating gtc pipeline #8

Closed ekarlins closed 7 years ago

ekarlins commented 7 years ago

Even with an existing R package that has similar functions, a working command line pipeline starting with idats (ideally) or gtc files could make this process a lot easier to run. Wrapping this in a workflow management system like Snakemake, which can submit jobs to an HPC cluster and track I/O will make parallel processing easy and run time should be really quick, after the initial upfront cost of generating an egt file in Genome Studio. Luckily, groups dealing with Illumina data will already be comfortable with Illumina files so generating an egt and using a bpm should be pretty straightforward. Hopefully, after that the pipeline will run with essentially one command.

Another plus for this approach is that if we contact Illumina about our tool, as Ben suggested, they may like that we are using some of their existing software as part of it (like the gtc parsing library). He's right, if they mention our tool it could mean a lot of exposure. In addition, I think this approach has a clearer implementation than the R approach. And since we need to get most of this project done tomorrow, that's appealing to me.

I have some experience with PennCNV, so I'll start with that. But adding other command line CNV callers should be pretty easy as well. And since they will all run in parallel, it doesn't add much cost (if you have the compute) to run more CNV callers. Since CNV calling in general has a fairly high FP rate, using multiple callers may be the way to go.

Though a lot of the code will be python, it will also involve a lot of testing of command line programs. Both testing CNV callers that we download, like PennCNV, and also testing scripts that we write. So knowing how to write python code is not the only thing needed in order to get this pipeline working.

Here's a rough idea of what the pipeline would look like:

wine AutoConvert.exe /path/to/idats /path/to/out/dir manifest.bpm clusterFile.egt ##hopefully this works

        |
        |
       \/
  gtc files
        |
        |
       \/

"scripts/gtc2PennCNV.py" ##I've already written this script, tested it, and pushed it to this repo

        |
        |
       \/

run PennCnv ##There will also be generation of a couple of PennCNV specific reference files which is another up front cost.

        |
        |
       \/

There are some existing scripts for post-processing on PennCNV output that we can try if we have time.

ngiangre commented 7 years ago

I agree that having a simple pipeline is more appealing, and having illumina advocate it is a huge plus. I’m gonna get through what I can tonight and then we can decide tomorrow where to focus our efforts. I think for this project having something that takes advantage of HPCs and can scale is imperitive, and R unfortunately isn’t the most scalable environment. However it is great for data science-y/visualization methods, so like you were saying for downstream things, I think R would come in handy there. Maybe we can make it I/O compatible i.e. make python output able to be loaded into R.

From: Eric Karlins notifications@github.com Reply-To: NCBI-Hackathons/Global_Screening_Arrays reply@reply.github.com Date: Monday, March 20, 2017 at 7:59 PM To: NCBI-Hackathons/Global_Screening_Arrays Global_Screening_Arrays@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [NCBI-Hackathons/Global_Screening_Arrays] Tentative plan for creating gtc pipeline (#8)

Even with an existing R package that has similar functions, a working command line pipeline starting with idats (ideally) or gtc files could make this process a lot easier to run. Wrapping this in a workflow management system like Snakemake, which can submit jobs to an HPC cluster and track I/O will make parallel processing easy and run time should be really quick, after the initial upfront cost of generating an egt file in Genome Studio. Luckily, groups dealing with Illumina data will already be comfortable with Illumina files so generating an egt and using a bpm should be pretty straightforward. Hopefully, after that the pipeline will run with essentially one command.

Another plus for this approach is that if we contact Illumina about our tool, as Ben suggested, they may like that we are using some of their existing software as part of it (like the gtc parsing library). He's right, if they mention our tool it could mean a lot of exposure. In addition, I think this approach has a clearer implementation than the R approach. And since we need to get most of this project done tomorrow, that's appealing to me.

I have some experience with PennCNV, so I'll start with that. But adding other command line CNV callers should be pretty easy as well. And since they will all run in parallel, it doesn't add much cost (if you have the compute) to run more CNV callers. Since CNV calling in general has a fairly high FP rate, using multiple callers may be the way to go.

Though a lot of the code will be python, it will also involve a lot of testing of command line programs. Both testing CNV callers that we download, like PennCNV, and also testing scripts that we write. So knowing how to write python code is not the only thing needed in order to get this pipeline working.

Here's a rough idea of what the pipeline would look like:

wine AutoConvert.exe /path/to/idats /path/to/out/dir manifest.bpm clusterFile.egt ##hopefully this works                        \/   gtc files                        \/ "scripts/gtc2PennCNV.py" ##I've already written this script, tested it, and pushed it to this repo
/ run PennCnv ##There will also be generation of a couple of PennCNV specific reference files which is another up front cost.

/ There are some existing scripts for post-processing on PennCNV output that we can try if we have time.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

ekarlins commented 7 years ago

@ngiangre, Snakemake can take R code directly, though I think behind the scenes it's calling it with the python package rpy2. R code can definitely be part of this "gtc" pipeline if certain steps make the most sense in R. And submitting small R jobs to the cluster is a good way to use R and parallelization, since, as you mention, R itself is not very scalable.

ngiangre commented 7 years ago

@ekarlins , after reviewing snakemake, PennCNV and gsrc, I think this pipeline is our best shot at getting ultimately a streamlined, testable raw cnv --> cmv calling and downstream analysis workflow.

DCGenomics commented 7 years ago

What do you mean by 'this'? Fwiw, we've built several successful projects based on snakemake.

Cheers!

Ben

On Mar 20, 2017 11:16 PM, "Nick Giangreco" notifications@github.com wrote:

@ekarlins https://github.com/ekarlins , after reviewing snakemake, PennCNV and gsrc, I think this pipeline is our best shot at getting ultimately a streamlined, testable raw cnv --> cmv calling and downstream analysis workflow.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NCBI-Hackathons/Global_Screening_Arrays/issues/8#issuecomment-287965022, or mute the thread https://github.com/notifications/unsubscribe-auth/AFePtcWC8WQZ4MxuBRm2mHEsVYSJVcvxks5rn0EOgaJpZM4MjKso .

ngiangre commented 7 years ago

I’m advocating snakemake. What I mean by this is by using snakemake to collate together the gtc -> pennCNV input files, PennCNV, maybe a gsrc cnv calling wrapper and other command line tools, I think that would make the most streamlined and possibly parallelizable program.

From: DCGenomics notifications@github.com Reply-To: NCBI-Hackathons/Global_Screening_Arrays reply@reply.github.com Date: Monday, March 20, 2017 at 11:36 PM To: NCBI-Hackathons/Global_Screening_Arrays Global_Screening_Arrays@noreply.github.com Cc: Nick Giangreco nick.giangreco@gmail.com, Mention mention@noreply.github.com Subject: Re: [NCBI-Hackathons/Global_Screening_Arrays] Tentative plan for creating gtc pipeline (#8)

What do you mean by 'this'? Fwiw, we've built several successful projects based on snakemake.

Cheers!

Ben

On Mar 20, 2017 11:16 PM, "Nick Giangreco" notifications@github.com wrote:

@ekarlins https://github.com/ekarlins , after reviewing snakemake, PennCNV and gsrc, I think this pipeline is our best shot at getting ultimately a streamlined, testable raw cnv --> cmv calling and downstream analysis workflow.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NCBI-Hackathons/Global_Screening_Arrays/issues/8#issuecomment-287965022, or mute the thread https://github.com/notifications/unsubscribe-auth/AFePtcWC8WQZ4MxuBRm2mHEsVYSJVcvxks5rn0EOgaJpZM4MjKso .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ngiangre commented 7 years ago

I’ve been looking at the snakemake functionality and it looks really elegant

From: DCGenomics notifications@github.com Reply-To: NCBI-Hackathons/Global_Screening_Arrays reply@reply.github.com Date: Monday, March 20, 2017 at 11:36 PM To: NCBI-Hackathons/Global_Screening_Arrays Global_Screening_Arrays@noreply.github.com Cc: Nick Giangreco nick.giangreco@gmail.com, Mention mention@noreply.github.com Subject: Re: [NCBI-Hackathons/Global_Screening_Arrays] Tentative plan for creating gtc pipeline (#8)

What do you mean by 'this'? Fwiw, we've built several successful projects based on snakemake.

Cheers!

Ben

On Mar 20, 2017 11:16 PM, "Nick Giangreco" notifications@github.com wrote:

@ekarlins https://github.com/ekarlins , after reviewing snakemake, PennCNV and gsrc, I think this pipeline is our best shot at getting ultimately a streamlined, testable raw cnv --> cmv calling and downstream analysis workflow.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NCBI-Hackathons/Global_Screening_Arrays/issues/8#issuecomment-287965022, or mute the thread https://github.com/notifications/unsubscribe-auth/AFePtcWC8WQZ4MxuBRm2mHEsVYSJVcvxks5rn0EOgaJpZM4MjKso .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

DCGenomics commented 7 years ago

Cool.

On Mar 20, 2017 11:42 PM, "Nick Giangreco" notifications@github.com wrote:

I’ve been looking at the snakemake functionality and it looks really elegant

From: DCGenomics notifications@github.com Reply-To: NCBI-Hackathons/Global_Screening_Arrays reply@reply.github.com Date: Monday, March 20, 2017 at 11:36 PM To: NCBI-Hackathons/Global_Screening_Arrays <Global_Screening_Arrays@ noreply.github.com> Cc: Nick Giangreco nick.giangreco@gmail.com, Mention < mention@noreply.github.com> Subject: Re: [NCBI-Hackathons/Global_Screening_Arrays] Tentative plan for creating gtc pipeline (#8)

What do you mean by 'this'? Fwiw, we've built several successful projects based on snakemake.

Cheers!

Ben

On Mar 20, 2017 11:16 PM, "Nick Giangreco" notifications@github.com wrote:

@ekarlins https://github.com/ekarlins , after reviewing snakemake, PennCNV and gsrc, I think this pipeline is our best shot at getting ultimately a streamlined, testable raw cnv --> cmv calling and downstream analysis workflow.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NCBI-Hackathons/Global_Screening_ Arrays/issues/8#issuecomment-287965022, or mute the thread https://github.com/notifications/unsubscribe-auth/ AFePtcWC8WQZ4MxuBRm2mHEsVYSJVcvxks5rn0EOgaJpZM4MjKso .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NCBI-Hackathons/Global_Screening_Arrays/issues/8#issuecomment-287968383, or mute the thread https://github.com/notifications/unsubscribe-auth/AFePtU7ZWFRZeEHgsngWdfsItg-O-Phrks5rn0cbgaJpZM4MjKso .

ekarlins commented 7 years ago

I'll keep this ticket open for general comments about the pipeline. We're going to move forward with this Snakemake pipeline and hopefully add some modules (Snakemake rules) using the R package "gsrc" as well. Let's try to split up features into as many issues as we can so we can clearly divide and conquer. To assist in getting stuff done, I'll assign issues to who I think it makes sense to work on it. But feel free to let me know if it's not in your comfort zone or if there are other issues you want to work on.

slsevilla commented 7 years ago

https://www.gliffy.com/go/publish/11884987

slsevilla commented 7 years ago

updated

https://www.gliffy.com/go/publish/11884987