Refine HCV genotype/subtype assignments

ArtPoon commented 8 years ago

As a result of the meeting on March 17, 2016, we are going to add a script to the pipeline that refines genotype assignments by collapsing the reference set to the sequences that map the most reads per genotype/subtype. This should make the bowtie2 map quality score more interpretable with respect to ambiguous mapping of a short read across genotypes or subtypes.

ArtPoon commented 8 years ago

I have created a new branch incorporating the "random primer" pipeline from Kive (mixed-hcv v1.0-9) and am drilling down on sample D56932 (S2). The first thing I noticed was that a substantial number of the pairs for which the first read mapped to genotype 3a had a garbled second read. For example, from the SAM output under v1.0-9:

M01841:182:000000000-AJHNJ:1:1101:14258:2161    89  HCV-3a-NC_009824    2205    2   17S157M77S  =   2205    0
M01841:182:000000000-AJHNJ:1:1101:14258:2161    133 HCV-3a-NC_009824    2205    0   *   =   2205    0

indicates that this read pair was mapped to a 3a reference, even though the second read is not recognizably HCV. I could not even find any matches to the nr database via megablast or discontiguous megablast. Even though this is not true for all pairs that mapped to a 3a reference, it does indicates that assigning pairs to reference genotypes on the basis of the first read only is risky.

So far I am taking the following steps

require both reads to map to HCV and the same genotype
experimenting with cutadapt for trimming adapter sequences from reads
stop writing reads that map to human genome (hg38) to SAM output, this is time- and disk space-consuming

veratai commented 8 years ago

FYI, unless the script has changed, it is already ignoring pairs of read that map to different references.

Really good idea to trim adapters from reads before mapping. (are you going to do this for Micall?)

On Apr 4, 2016, at 2:30 PM, Art Poon notifications@github.com wrote:

I have created a new branch incorporating the "random primer" pipeline from Kive (mixed-hcv v1.0-9) and am drilling down on sample D56932 (S2). The first thing I noticed was that a substantial number of the pairs for which the first read mapped to genotype 3a had a garbled second read. For example, from the SAM output under v1.0-9:

M01841:182:000000000-AJHNJ:1:1101:14258:2161 89 HCV-3a-NC_009824 2205 2 17S157M77S = 2205 0 M01841:182:000000000-AJHNJ:1:1101:14258:2161 133 HCV-3a-NC_009824 2205 0 * = 2205 0 indicates that this read pair was mapped to a 3a reference, even though the second read is not recognizably HCV. I could not even find any matches to the nr database via megablast or discontiguous megablast. Even though this is not true for all pairs that mapped to a 3a reference, it does indicates that assigning pairs to reference genotypes on the basis of the first read only is risky.

So far I am taking the following steps

require both reads to map to HCV and the same genotype experimenting with cutadapt for trimming adapter sequences from reads stop writing reads that map to human genome (hg38) to SAM output, this is time- and disk space-consuming — You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub

cfe-lab / mixed-hcv

Refine HCV genotype/subtype assignments #8