Open ArtPoon opened 8 years ago
I have created a new branch incorporating the "random primer" pipeline from Kive (mixed-hcv v1.0-9) and am drilling down on sample D56932 (S2). The first thing I noticed was that a substantial number of the pairs for which the first read mapped to genotype 3a had a garbled second read. For example, from the SAM output under v1.0-9:
M01841:182:000000000-AJHNJ:1:1101:14258:2161 89 HCV-3a-NC_009824 2205 2 17S157M77S = 2205 0
M01841:182:000000000-AJHNJ:1:1101:14258:2161 133 HCV-3a-NC_009824 2205 0 * = 2205 0
indicates that this read pair was mapped to a 3a reference, even though the second read is not recognizably HCV. I could not even find any matches to the nr database via megablast or discontiguous megablast. Even though this is not true for all pairs that mapped to a 3a reference, it does indicates that assigning pairs to reference genotypes on the basis of the first read only is risky.
So far I am taking the following steps
cutadapt
for trimming adapter sequences from readsFYI, unless the script has changed, it is already ignoring pairs of read that map to different references.
Really good idea to trim adapters from reads before mapping. (are you going to do this for Micall?)
On Apr 4, 2016, at 2:30 PM, Art Poon notifications@github.com wrote:
I have created a new branch incorporating the "random primer" pipeline from Kive (mixed-hcv v1.0-9) and am drilling down on sample D56932 (S2). The first thing I noticed was that a substantial number of the pairs for which the first read mapped to genotype 3a had a garbled second read. For example, from the SAM output under v1.0-9:
M01841:182:000000000-AJHNJ:1:1101:14258:2161 89 HCV-3a-NC_009824 2205 2 17S157M77S = 2205 0 M01841:182:000000000-AJHNJ:1:1101:14258:2161 133 HCV-3a-NC_009824 2205 0 * = 2205 0 indicates that this read pair was mapped to a 3a reference, even though the second read is not recognizably HCV. I could not even find any matches to the nr database via megablast or discontiguous megablast. Even though this is not true for all pairs that mapped to a 3a reference, it does indicates that assigning pairs to reference genotypes on the basis of the first read only is risky.
So far I am taking the following steps
require both reads to map to HCV and the same genotype experimenting with cutadapt for trimming adapter sequences from reads stop writing reads that map to human genome (hg38) to SAM output, this is time- and disk space-consuming — You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub
As a result of the meeting on March 17, 2016, we are going to add a script to the pipeline that refines genotype assignments by collapsing the reference set to the sequences that map the most reads per genotype/subtype. This should make the bowtie2 map quality score more interpretable with respect to ambiguous mapping of a short read across genotypes or subtypes.