Closed samnooij closed 5 years ago
I looked at the results with this new setting and the results are... different.
Interestingly, for some samples the total number of mapped reads declines, as I expected. However, for other samples, the total number of mapped reads ends up higher.
Moreover, I see just now that previously, contigs of 250+ nt were considered, while in this re-analysis only contigs 500+ nt are reported. And with one sample, the fewer but longer contigs (in the re-analysis) have more reads mapping to them than the more, shorter contigs.
I am not so sure how to handle this now... I did not expect counting reads would become this complicated.
I'm a bit lost about that too. You have more mapped reads when you tell it to ignore secondary alignments? That's really weird.
About the >=250 nt contigs and and >=500 nt contigs, Are you testing this in the master
or the dev
branch? Because in the dev
branch @florianzwagemaker is working on a mode
option which changes Jovian's stringency. Currently we have a strict mode
setting, the default settings in the 'master' branch, with min contig length of 500 nt and Phred >30 requirement for raw reads. We know that is too strict (no false positives but many false negatives (for high Ct/low titer viruses)). Therefore, the 'relaxed mode' is included with min contig length (indeed to >=250 nt) and the minimum allowed Phred score of raw reads (to >=20). So please check that, because that influences the SPAdes assembly and might explain the weird results.
I have been counting lots and lots of reads. It has been a very confusing and frustrating experience...
I have counted reads in fastq files and bam files, with seqkit and samtools. The problem is in the read mapping with BWA and then counting those reads. It looks like in some samples there are many multimappers. I have tried filtering uniques by taking read IDs (samtools ... | cut -f 1 | sort | uniq | wc -l
), filtering based on mapping quality (samtools -q 1 ...
), filtering based on a "SA:" flag (samtools ... | grep "SA:" | sort | uniq | wc -l
), checking numbers with samtools flagstat
, some counts from samtools (e.g. samtools -c -f 4 ...
).
My conclusions:
(The 'closest' I think I have come is in a sample with 3,684,278 total reads (*pR1.fq + *pR2.fq = the input for BWA) where I counted 3,836,826 mapped + unmapped reads. In other words, the read count is not off by a few reads, but by a few hundred thousand reads, which is roughly 1-15% of the total number of reads in the sample. This number is also reflected by the Sample_composition_graph.html
(shown as percentages) where I have seen bars up to ~110%.)
So far my counting reports. I am giving up for now.
(_N.B. the error is not in the visualisation by the Python script. The error is in the way mapped reads are counted = mpileup.sh
in the 'scaffoldanalyses' environment. I have not found any better solution.)
edit: the contig minimum length filter was set to >=500. Trimming quality (SLIDINGWINDOW) to Phred 30.
Together with @DennisSchmitz I found the right way to count the mapped reads, so we can fix this bug at last.
The right way to count mapped reads is like (pseudocode with shell commands):
mapped_paired = $(samtools view -c -F 4 -F 8 -F 256 -F 2048 data/scaffolds_filtered/${id}_L001_sorted.bam)
mapped_single = $(samtools view -c -F 4 -f 8 -F 256 -F 2048 data/scaffolds_filtered/${id}_L001_sorted.bam)
mapped_total = mapped_paired + mapped_single
unmapped = $(samtools view -c -f 4 -F 256 -F 2048 data/scaffolds_filtered/${id}_L001_sorted.bam)
Note that this still has to be implemented for each sample. (Sample name should go where '${id}' is now. Only required tool is samtools
.)
_These numbers should replace the pileup.sh numbers and overwrite them in results/all_taxClassified.tsv
._
_edit: maybe these numbers should not overwrite the output we generate now, but can be made in a new table that draw_heatmaps
and quantify_profiles
may use to quantify. A new rule to make this file still seems to be a good idea._
edit2: of course you can also count the number of mapped reads in one line:
mapped_total = $(samtools view -c -F 4 -F 256 -F 2048 data/scaffolds_filtered/${id}_L001_sorted.bam)
To add numbers per contig, we can use something like:
samtools view -F 4 -F 8 -F 256 -F 2048 data/scaffolds_filtered/{sample}_L00*_sorted.bam | cut -f 3 | sort | uniq -c
To filter contig names from the bam file, and count the unique occurrences. If you want to separately count forward and reverse reads, also filter flag 16 (i.e. do a -F 16
and -f 16
for forward and reverse, respectively).
I made a new script, bin/count_mapped_reads.sh
, and updated the Snakefile
and bin/quantify_profiles.py
accordingly. The problem should be solved now. I have no more > 100% bars in my test dataset. (See commit https://github.com/DennisSchmitz/Jovian/commit/61c823ba5e9e9efa12a8bbc0dfd7c3c0021502f9 and https://github.com/DennisSchmitz/Jovian/commit/2fbf87278b4e2ab510df5dfb532a0b794b506eeb)
In some cases, the total number of reads mapped back to (classified) contigs is higher than the total number of (raw) reads. This should not be possible and is probably caused by counting 'multi-mappers', i.e. reads that map to several contigs. It looks like these reads count toward all contigs they map to currently, and if there are many such reads, the total number of mapped reads is higher than the number of raw reads.
This also leads to confusing quantifications by
quantify_output
and > 100% bars inresults/Sample_composition_graph.html
. (So it seems in these cases the counting of numbers actually went right, but the provided numbers are fishy.)The issue may be fixed by changing the 'secondary' parameter in the
pileup.sh
tool from the ruleGenerate_contigs_metrics
. I am currently testing this and will push this change if it works.