broadinstitute / viral-ngs

Viral genomics analysis pipelines
Other
188 stars 67 forks source link

Automate recipes in taxon_filter #7

Closed iljungr closed 9 years ago

dpark01 commented 9 years ago

How close are you to completing most of the necessary steps in the human & contaminant read depletion pipeline? (skipping the novoalign steps that we won't use).

I ask because I just a set of data files (input, intermediates, and final output) from Kristian for two samples run through his depletion steps, and we can try to compare how close our outputs are to his. In theory, assuming all steps are deterministic, some of these files might look exactly the same run through his steps and our steps, but I think we'd have to be prepared for the possibility that the files may be mostly the same and we'd need to do an analysis of whether it's "close enough". In particular, maybe just read counts at each stage (beginning, post-bmtagger, post-blast-and-concatenation, etc) and see how close they are his way and with our code.

@iljungr, would you be up for this kind of an analysis project or would you prefer to keep coding new tools and I can find someone else to validate this?

For now, it would involve running each command manually. If the regression test looks good, then I can introduce a pipelining framework that's our currently leading option for stringing this all together in an automated way. This depletion pipeline is probably the best place to start trying that framework, since it's a small and defined unit that can be thought of separately from the rest of the analyses.

iljungr commented 9 years ago

Here's what I've done and what remains to complete the read depletion pipeline. You had suggested in your earlier email that the desired steps are something like: 1) bmtagger to pull aside microbial reads for later use. 2) bmtagger to deplete human reads. 3) m-vicuna for duplicate removal 4) blastn for more thorough human depletion. 5) Add back microbial reads from step 1, fix anything that needs fixing, convert to bam.

What is done so far is the partition_bmtagger utility to be used for steps 1 and 2, and the installation of blastn for step 4. I think I can do the rest in a few days except for the following issues:

a) I don't know what to do about installation of m-vicuna. All I have is a path on broad -- I can't find anything public. It is compiled C++, not just some wrapper script around vicuna. For our internal testing I can use the broad path, but I don't know what to do for Travis or the released product. I think we should contact Xiao and ask about it. Will you do that or should I?

b) For blastn there is the issue of managing the use of multiple nodes. It should be easy enough to start by implementing the 3 pieces: split, blastn one piece, merge. I was also thinking of putting in 3 higher level utilities to bsub the jobs, check the progress of those jobs, and merge the results. We would use that internally and offer it to anyone who has bsub available, but it couldn't be tested on Travis. For an initial test run that could be done manually.

c) Of course I have no idea what will come up in the "fix anything that needs fixing" part of step 5.

As to matching the old intermediate steps, we would have to use the old sequence of steps, with novoalign. I don't see any particular problem with this for in-house use; it would test our low-level utilities but not the pipeline. For now, it would test the bmtagger tool and perhaps filter_lastal and trimmomatic if they are part of the same workflow.

I'm fine with working on that if that's the next thing to do.

On Nov 3, 2014, at 9:54 AM, Daniel Park notifications@github.com wrote:

How close are you to completing most of the necessary steps in the human & contaminant read depletion pipeline? (skipping the novoalign steps that we won't use).

I ask because I just a set of data files (input, intermediates, and final output) from Kristian for two samples run through his depletion steps, and we can try to compare how close our outputs are to his. In theory, assuming all steps are deterministic, some of these files might look exactly the same run through his steps and our steps, but I think we'd have to be prepared for the possibility that the files may be mostly the same and we'd need to do an analysis of whether it's "close enough". In particular, maybe just read counts at each stage (beginning, post-bmtagger, post-blast-and-concatenation, etc) and see how close they are his way and with our code.

@iljungr, would you be up for this kind of an analysis project or would you prefer to keep coding new tools and I can find someone else to validate this?

For now, it would involve running each command manually. If the regression test looks good, then I can introduce a pipelining framework that's our currently leading option for stringing this all together in an automated way. This depletion pipeline is probably the best place to start trying that framework, since it's a small and defined unit that can be thought of separately from the rest of the analyses.

— Reply to this email directly or view it on GitHub.

dpark01 commented 9 years ago

Here's what I have quick answers for:

A. M-Vicuna, I've learned, was a side project that did not get fully developed and released. However, it was developed by Xiao at the Broad while he was in our group, and the IP belongs to the Broad. As such, once I find the C++ source code, I think it is totally fine to just push the code into some subdir of this repository, along with some makefile, and distribute it that way. (it's not even "re"-distributing, since it was never publicly released, and we're the primary owners anyway). So that's how we can deal with that.

B. Don't worry about the management of multiple nodes and the automation of the end-to-end process for blastn. Just write up each component piece separately (split, run, merge). We'll glue it together with the pipelining framework separately. These wrapper scripts will feel far more granular than the other ones, but that's a bit unavoidable.

iljungr commented 9 years ago

The source for M-Vicuna is here: /gsap/garage-viral/viral/analysis/xyang/programs/M-Vicuna/src

At the higher level is compile.txt that calls "use .gcc-4.7.2" and then make. That won't work on Mac. I'll see if it compiles on my Mac.

On Nov 3, 2014, at 10:43 AM, Daniel Park notifications@github.com wrote:

Here's what I have quick answers for:

A. M-Vicuna, I've learned, was a side project that did not get fully developed and released. However, it was developed by Xiao at the Broad while he was in our group, and the IP belongs to the Broad. As such, once I find the C++ source code, I think it is totally fine to just push the code into some subdir of this repository, along with some makefile, and distribute it that way. (it's not even "re"-distributing, since it was never publicly released, and we're the primary owners anyway). So that's how we can deal with that.

B. Don't worry about the management of multiple nodes and the automation of the end-to-end process for blastn. Just write up each component piece separately (split, run, merge). We'll glue it together with the pipelining framework separately. These wrapper scripts will feel far more granular than the other ones, but that's a bit unavoidable. — Reply to this email directly or view it on GitHub.

iljungr commented 9 years ago

It does NOT compile on my Mac (even when I changed the COMPILER variable in src/makefile to my default gcc). Can't find omp.h. On stackoverflow, it says that to get omp.h you should install gcc-4.9. How far do we want to take this?

On Nov 3, 2014, at 11:01 AM, Irwin Jungreis ILJungr@csail.mit.edu wrote:

The source for M-Vicuna is here: /gsap/garage-viral/viral/analysis/xyang/programs/M-Vicuna/src

At the higher level is compile.txt that calls "use .gcc-4.7.2" and then make. That won't work on Mac. I'll see if it compiles on my Mac.

On Nov 3, 2014, at 10:43 AM, Daniel Park notifications@github.com wrote:

Here's what I have quick answers for:

A. M-Vicuna, I've learned, was a side project that did not get fully developed and released. However, it was developed by Xiao at the Broad while he was in our group, and the IP belongs to the Broad. As such, once I find the C++ source code, I think it is totally fine to just push the code into some subdir of this repository, along with some makefile, and distribute it that way. (it's not even "re"-distributing, since it was never publicly released, and we're the primary owners anyway). So that's how we can deal with that.

B. Don't worry about the management of multiple nodes and the automation of the end-to-end process for blastn. Just write up each component piece separately (split, run, merge). We'll glue it together with the pipelining framework separately. These wrapper scripts will feel far more granular than the other ones, but that's a bit unavoidable. — Reply to this email directly or view it on GitHub.

iljungr commented 9 years ago

Actually, why not distribute the binary instead of building from source? We can distribute a linux binary, and if we can get it to build on some mac we can distribute that without forcing our end user to install a particular compiler.

On Nov 3, 2014, at 11:13 AM, Irwin Jungreis ILJungr@csail.mit.edu wrote:

It does NOT compile on my Mac (even when I changed the COMPILER variable in src/makefile to my default gcc). Can't find omp.h. On stackoverflow, it says that to get omp.h you should install gcc-4.9. How far do we want to take this?

On Nov 3, 2014, at 11:01 AM, Irwin Jungreis ILJungr@csail.mit.edu wrote:

The source for M-Vicuna is here: /gsap/garage-viral/viral/analysis/xyang/programs/M-Vicuna/src

At the higher level is compile.txt that calls "use .gcc-4.7.2" and then make. That won't work on Mac. I'll see if it compiles on my Mac.

On Nov 3, 2014, at 10:43 AM, Daniel Park notifications@github.com wrote:

Here's what I have quick answers for:

A. M-Vicuna, I've learned, was a side project that did not get fully developed and released. However, it was developed by Xiao at the Broad while he was in our group, and the IP belongs to the Broad. As such, once I find the C++ source code, I think it is totally fine to just push the code into some subdir of this repository, along with some makefile, and distribute it that way. (it's not even "re"-distributing, since it was never publicly released, and we're the primary owners anyway). So that's how we can deal with that.

B. Don't worry about the management of multiple nodes and the automation of the end-to-end process for blastn. Just write up each component piece separately (split, run, merge). We'll glue it together with the pipelining framework separately. These wrapper scripts will feel far more granular than the other ones, but that's a bit unavoidable. — Reply to this email directly or view it on GitHub.

dpark01 commented 9 years ago

Sounds good to me, how about we have an InstallMethod which attempts to determine if the distributed linux binary executes properly or not. If not, fall back to an OSX-compiled binary. If not, try to build from source (which should work if they have a more recent GCC/glibc installed, but that'll probably be hit/miss). The first method (pre-distributed linux binary.. perhaps even statically linked against libraries if we're paranoid about it) should work most of the time, and even for Travis.

Also, most of M-Vicuna is not used by Kristian's recipes... just one of the tools in it. So it's possible that we don't have to bloat this too much.

Danny

dpark01 commented 9 years ago

Ok, after wrapping up M-Vicuna, (and it looks like you're already wrapping in the noBlastHits_v3 and megeShuffledFastqSeqs scripts), and exposing them as runnable commands, we probably have most of the component steps done for this pipeline.

I might try to start implementing the pipelineing framework soon then.

I also have someone else in our lab who has a few cycles to do the more nitty gritty regression analysis (how "close enough" is our output to Kristian's old output).

Oh, there's the blastn (split, run, merge, and filter fastqs based on blast hits) pieces too.

dpark01 commented 9 years ago

Incidentally, who is working on blastn (#24) and samtools (#23), @hannaml or @iljungr?

iljungr commented 9 years ago

I initially created the tools.Tool for those. I think hanna created those issues for making some changes to those. I am planning to use the tools in automating taxon_filter.

Irwin

iljungr commented 9 years ago

OK. I will try to define interfaces for all of the runnable commands and submit them before actually implementing them. That way you'll know what's coming as you define the pipelining framework.

iljungr commented 9 years ago

Naming convention for paired-end fastq files:

In most of the pieces of the taxon_filter pipeline, the input is two fastq files representing paired-end reads, and the output includes one or more pairs of fastq files. Requiring the caller to name every input and output file would make the interface cumbersome. Instead, we could require that paired files are always of the form SOMETHING.1.fastq and SOMETHING.2.fastq (or, perhaps, to be more general, SOMETHING.fastq and SOMETHING.fastq, without requiring the period before the digit), and just have the user tell us what SOMETHING is.

Here are three options: 1) Have the user name both input files and both (or more) output files. This is the most general, but the most cumbersome. 2) Have the user name both input files but always output .1.fast and .2.fastq. This is how partition_bmtagger is currently set up. It allows for handling input files that came out of somewhere else and don't follow our convention, but then funnels the user into our convention with the output files. 3) Require all input and output files to be .1.fastq and .2.fastq. This makes the least cumbersome interface. If necessary, the user can rename files or create links before or after the call.

My recommendation is to go with option 3 and do it consistently for all of our tools that work with paired end reads. Opinions?

dpark01 commented 9 years ago

Regarding the naming of fastq files, let me get back to you on this tomorrow, since this might depend a bit on how I'm going to wrap this all together. So ... maybe do something for now, but don't get too married to it quite yet.

That being said, what do you think of implementing option 1 at the python method level (ie, have the partition_bmtagger method take explicitly named filenames for input and output, not making any assumptions about naming conventions) but option 3 at the command line level (ie, the parser_partition_bmtagger and main_partition_bmtagger methods assume a naming convention for input/output filenames, break them out, and then feed explicitly named files into the call to partition_bmtagger. This maintains a certain amount of generality / code reusability, but then the human interface is potentially cleaner, and we can change it later if we have to.

dpark01 commented 9 years ago

Looks like @hannaml 's two open branches on samtools and blastn have mostly to do with adding test coverage?