Report unmatched reads in sam2aln

cfe-lab / MiCall

Pipeline for processing FASTQ data from an Illumina MiSeq to genotype human RNA viruses like HIV and hepatitis C

https://cfe-lab.github.io/MiCall

GNU Affero General Public License v3.0

14 stars 9 forks source link

Report unmatched reads in sam2aln #196

Closed donkirkby closed 9 years ago

donkirkby commented 9 years ago

If a read is rejected for low quality in the remap stage, its paired read never gets matched in the sam2aln stage. Currently, that is not reported, but unmatched reads could easily be added to the failed_read.csv file.

One example is sample 60780A-HLA-B_S72 from the 1 Apr 2015 run.

[x] Is this particular scenario important?
[x] Report qname and reason, but not sequence or quality.

donkirkby commented 9 years ago

This happened to all the forward reads in the 23 Mar 2015 run. Should we have some limit on how much data we report in the failed_read.csv file? A typical FASTQ file on this run was about 50MB, so the collated failed_read.csv file would be roughly 500MB.

donkirkby commented 9 years ago

Would it be helpful to extend the collated_counts.csv file through more steps? That way you could see where reads got lost along the way. We'd probably want to consistently report read pairs. Some ideas for entries in the list:

raw
preliminary map to reference X
remap to reference X
unmapped
pair mapped to different references
pair with unmapped mate
pair failed to align with consensus
pairs in consensus that failed to align with all coordinate references