broadinstitute / viral-ngs

Viral genomics analysis pipelines
Other
190 stars 67 forks source link

tolerate single-end reads throughout pipeline #238

Open dpark01 opened 8 years ago

dpark01 commented 8 years ago

All throughout the pipeline, there are places that assume paired-end reads, and it's mostly how we wrap underlying tools, probably not the underlying tools themselves. For example, BAM->fastq conversion for bmtagger, Trinity, subsampling, etc, all assumes a pair of fastqs in the output. It's likely that the final alignment and refine steps are fine though (Novoalign, Picard, GATK) since they are entirely BAM-based. It's the steps that have fastq intermediates that are the most likely to be broken. I think we even have a place where we explicitly toss out unpaired reads--we may need to make that an optional boolean flag.

We should tolerate BAMs with single-end reads throughout as well as BAMs that have a mixture of both single and paired end reads. It would make analyses in the field (where sequencing runs might fail halfway through) more robust.

This is a lower priority enhancement for later down the road.

dpark01 commented 7 years ago

@biocyberman I'm a bit surprised that you ran into issues with unpaired read data on the refine_assembly steps (but not at all surprised that there were errors on deplete_human). To better assist, do you have some (publicly shareable) example read data? Perhaps a small BAM file of reads that we could use for debugging. We can generate fake data ourselves pretty easily if you don't have something readily available, but it doesn't hurt to try something more realistic.

dpark01 commented 7 years ago

Hm... looks like https://github.com/broadinstitute/viral-ngs/blob/master/tools/novoalign.py#L180 ought to be changed from BAMPE to BAM. I think this may be a somewhat recent addition to Novoalign and greatly simplifies its use on mixed data (it doesn't seem to exist in the v3.00 manual). That should solve the refine_assembly bits. As far as I can guess, all of the remaining problems with single-end data should be limited to the stuff in taxon_filter.py.

@biocyberman if you have example data, we can make the change and codify a unit test around it.

dpark01 commented 7 years ago

I've made some changes to the dp-unpaired branch that should address issues in the lastal filtering step as well as the refine_assembly step. This does not fix any of the bmtagger/blast human depletion steps yet. It is also untested, pending some test case data.

biocyberman commented 7 years ago

@dpark01 Here are some ion torrent files. I will leave them online for about a week: https://drive.google.com/drive/u/1/folders/0ByYppRKDpYO_WGlrV3FRR3l3SDA

I will be able to code some parts of the adjustment to accommodate unpaired reads. How do we arrange so that we do not repeat each other?

By the way, the krakken and probably diamond subcommands of metagenome.py, and assembly.py also need to be adjusted. Well, the assembly.py is from the top of my head because I couldn't use it last time. I had to run Trinity independently.

dpark01 commented 7 years ago

I wouldn't worry too much about duplicated effort: sounds like you need to hack it quickly and I want it to work for the general case. Main thing would be if you could flag the problematic steps for single end reads as you find them. For example I'm surprised you ran into any problems in assembly.py. Where in assembly.py?

dpark01 commented 7 years ago

Oh sorry I realize you said it was the trinity step. I'll try to look at that more closely.

biocyberman commented 7 years ago

I will be happy to flag :) At the same time ,you got the sample data so you can see the problems easily by running it through the pipeline.

biocyberman commented 7 years ago

@tomkinsc I am adding taxon_filter.py deplete_bmtagger_bam to the list of adjustment. Current this subcommand does not work with unpaired reads and furthermore, it does not actually support searching across multiple bmtagger databases. The documentation however describes that it does.

dpark01 commented 7 years ago

I'm aware of bmtagger as an issue and something I was going to look at today. However, why do you say it does not support searching across multiple databases?

biocyberman commented 7 years ago

@dpark01 Oh that's was an incorrect speculation when I was reading the source code of deplete_bmtagger_bam only. I did not see the multi_db_deplete_bam function in place.

dpark01 commented 7 years ago

@biocyberman the dp-unpaired branch is getting pretty close, I'd recommend using this as a starting point for unpaired-read hacking.

biocyberman commented 7 years ago

@dpark01 I pulled all the new commits in the dp-unpaired branch and merged them with my current branch. I then could finish this command without an error:

assembly.py assemble_trinity IonXpress_001_rawlib.cutad.last.bam /resources/viralngs/contaminants-afmd.fasta IonXpress_001_rawlib.cutad.last.astri.fa

However the output IonXpress_001_rawlib.cutad.last.astrim.fa is empty because most of the reads were thrown away during trimming step. So I came in and modified parameters of the call to execute method. Specifically I lowered sliding_window_q_cutoff=15, it was 25 by default. So some suggestions:

  1. It will be more granular and straight forward to take the trimming part out of assemble_trinity (or any other assembler later). Let users call trimming if they want to.
  2. Trimming parameters should be customizable. If there are too many command line parameters to take care of, you can put it in a run config in yml format or something like that. This does not hamper reproducibility.
dpark01 commented 7 years ago

@biocyberman I agree with exposing more optional parameters (on the command line and in the config file) as use cases turn up that show the need for it. These parameters were originally tuned on data we've spent a lot of time with, and it's helpful to know where that's different from what others are seeing.

That being said, aren't you concerned if most of your bases are coming in below Q25? We would normally consider that to be a suspect run, but perhaps the IonTorrent behavior is different, or perhaps we're converting IonTorrent quality scores incorrectly?

The philosophy is to be pretty conservative on what's going into de novo assembly via Trinity, since errors that get introduced there are not always easy to smooth out later on during refinement. There are downstream opportunities (during refine_assembly) for all of the reads to fill out and improve the assembly (and those steps get proper statistics from the aligners and callers at that point), but for Trinity, we give it pretty aggressive trimming and duplicate removal.

The other thing that highly affects assembly success (other than the success of your wet lab work) is whether you got your lastal filtration database (and scaffolding reference genome) genetically close enough to what you sequenced. If you're working on HCV (which ranks up there in genomic diversity), this can be tricky and may require experimentation.

biocyberman commented 7 years ago

Oh yeah I am concerned with the quality of the reads. It is not that low normally. I am trying to make the best out of the data and bring the results to our meeting next week. I will also generate fastQC report.

I selected HCV read via taxon_filter.py filter_lastal_bam agains about 1000 HCV genomes. However, in the end, I still got non-HCV contigs (human mitochodrion, some bacteria, some ribosomal sequences). So it can be that LASTal search is too loose. I even then ran the data through BMTAGGER databases of human genome, human transcriptome, and metagenome contaminant V3. I still got non-HCV contigs.

Thanks for sharing your thoughts wet lab and data analysis.