Closed donkirkby closed 4 years ago
@jeff-k has been using de novo assembly to look at several samples that got strange results with the current MiCall pipeline. It looks like one advantage of the technique will be that we can distinguish between these two scenarios:
With the current MiCall pipeline, both of those scenarios just look like lousy mapping with gaps in coverage.
We propose this plan for a full MiCall pipeline that includes de novo assembly:
prelim_map
and remap
as we combine the reads todayAs you can see, this just affects the prelim_map
and remap
steps. Because they're only running on a small number of contigs, they should be much faster.
One risk is that the de novo assembly step might be much slower in some cases than the remap step is currently.
We currently use bowtie2 to map reads to a large set of reference sequences. For most samples, it works well. However, we have had some problems with reference drift (#290), calling HCV subtypes (#436), insertion and deletion positions (#398), and samples that produce different results when you rerun the mapping (#405).
We'd like to experiment with using de novo assembly instead of mapping.
Smith-WatermanGotoh to align all the contigs onto the referencesuse Smith-Waterman to align all the primers onto all the contigsmoved to #478.cascade.csv
use denovo pipeline as a backup for denovo combined pipeline in Kive watchernuc_detail.csv
combineNot needed after embedding contigs in ref.amino_detail.csv
andnuc_detail.csv
by seed groups, not by seeds. For example, sample 1693-1IN2C2-HIV_S16 from the 09-Aug-2019.M01841 run.contig_coverage
files togenome_coverage
, and produce them from both the denovo version and the mapped versiongenome_coverage.svg
files in a separate folder from the other coverage mapsde novo assembly is very slow for some samplesIVA seems better than savage.should G2P continue to use merged reads, or should it switch to aligned reads?moved to #481should we try to report V3LOOP overlap again?moved to #481micall_basespace.py
should we make the contig coverage diagram match what we used to cut up the gene regions? 73051ANS5A1-HCV-NS5a_S89 from 15-Jul-2016.M01841 is an example where they don't match.moved to #479.use BLAST results to assemble contigs into a full reference? Haven't found any clear cases where they should be combined. HIV3428P100IN200-C19-HIV-S51 from 20 Sep 2019 run is the closest, but it looks like one contig has primer at the end. Samples HIV0887-P2D21-HIV_S3 and HIV0887-P2C12-HIV_S32 from 30 Aug 2019 looks even better, but have very little overlap. Some of the HCV samples look more promising: 73060A-HCV_S46 from 15 Jul 2016, for example.Moved to issue #484.amino_details.csv
and combination intoamino.csv
deal with HIV references that don't reach 5' and 3' ends or bring back the refs we removedMoved to issue #484.check for similar problems with other seed groupsMoved to issue #484.