cfe-lab / MiCall

Pipeline for processing FASTQ data from an Illumina MiSeq to genotype human RNA viruses like HIV and hepatitis C
https://cfe-lab.github.io/MiCall
GNU Affero General Public License v3.0
14 stars 9 forks source link

Put all of HIV on a single seed #366

Closed donkirkby closed 7 years ago

donkirkby commented 8 years ago

We solved issue #285 by putting PR, RT and INT all on the HIV-pol seed. However, we still see some coverage drop offs at the edges of other HIV regions. For example, sample 44841AD4-HLA-B-E9253012800pico-PR-RT_S67: HIV from the 18 Jun 2014 run drops off at the right end of HIV1B-gag. If we put all of the HIV regions on a single seed like we do for HCV, that would probably solve the drop off problems at region boundaries.

Was there a reason that we didn't do it with a single seed reference? Is it because V3LOOP uses consensus B as a seed while everything else uses HXB2? Should we just use consensus B as the seed for everything?

Conclusion

Following discussion with Art, Chanson, and Winnie, we will switch to using the HIV consensus B reference as a seed reference for the full HIV genome. We will continue to use the HXB2 references as coordinate references.

Update

Many V3-LOOP samples have less than 100 reads successfully mapped in the prelim_map step, and then map many more in the remap step. For example, sample 68068 in the 6 Jan 2017 run didn't map any reads. When I ran prelim_map with bowtie2's --local option, it mapped just fine.

Experiment with using several HIV references to see if V3-LOOP could map more successfully. Another option is changing bowtie2's options for the prelim_map step, but I think that caused too many HCV genotypes to be chosen.

Remaining tasks:

ArtPoon commented 8 years ago

I think having separate seed references for the HIV genes/regions was a decision that preceded our use of coordinate reference sequences. I can't think of a good reason why we shouldn't use reference genomes as seeds instead. One scenario that comes to mind is if the lab is running amplicons from different samples covering different regions (say, PR-RT and V3) using the same indices, but the coordinate references should be able to pop these out into separate streams.

donkirkby commented 7 years ago

The V3LOOP reference we currently use matches the HIV-env Consensus B reference that I downloaded from the curated alignments page at lanl.gov. I used the following parameters to find it:

Unfortunately, the 2004 data doesn't include a full genome region, so I have to go back to 2002 to get that. I'll experiment with the 2002 data for now, but we may want to look at using a set of sequences from the HIV Sequence Database Compendium as our seed references. We could continue to use HXB2 and Consensus B as our coordinate references for aligning the reports.

donkirkby commented 7 years ago

Using the full genome from the Consensus B reference from 2002 solved the boundary drop off problem in sample 44841AD4-HLA-B-E9253012800pico-PR-RT_S67: HIV from the 18 Jun 2014 run.

However, the samples described in issue #370 with poor V3LOOP mapping didn't map at all. Other samples from the same run didn't change their mapping: exactly the same number of reads mapped in the prelim_map step. The samples were 68088A and 95815A-HLA-B-96069A-V3-2-V3LOOP.

The next thing to try is the 199 sequences from the compendium.