DennisSchmitz / Jovian_archive

Metagenomics/viromics pipeline that focuses on automation, user-friendliness and a clear audit trail. Jovian aims to empower classical biologists and wet-lab personnel to do metagenomics/viromics analyses themselves, without bioinformatics expertise.
GNU Affero General Public License v3.0
18 stars 7 forks source link

Number of reads mapped to contigs > total reads #56

Closed samnooij closed 5 years ago

samnooij commented 5 years ago

In some cases, the total number of reads mapped back to (classified) contigs is higher than the total number of (raw) reads. This should not be possible and is probably caused by counting 'multi-mappers', i.e. reads that map to several contigs. It looks like these reads count toward all contigs they map to currently, and if there are many such reads, the total number of mapped reads is higher than the number of raw reads.

This also leads to confusing quantifications by quantify_output and > 100% bars in results/Sample_composition_graph.html. (So it seems in these cases the counting of numbers actually went right, but the provided numbers are fishy.)

The issue may be fixed by changing the 'secondary' parameter in the pileup.sh tool from the rule Generate_contigs_metrics. I am currently testing this and will push this change if it works.

samnooij commented 5 years ago

I looked at the results with this new setting and the results are... different.

Interestingly, for some samples the total number of mapped reads declines, as I expected. However, for other samples, the total number of mapped reads ends up higher.

Moreover, I see just now that previously, contigs of 250+ nt were considered, while in this re-analysis only contigs 500+ nt are reported. And with one sample, the fewer but longer contigs (in the re-analysis) have more reads mapping to them than the more, shorter contigs.

I am not so sure how to handle this now... I did not expect counting reads would become this complicated.

DennisSchmitz commented 5 years ago

I'm a bit lost about that too. You have more mapped reads when you tell it to ignore secondary alignments? That's really weird.

About the >=250 nt contigs and and >=500 nt contigs, Are you testing this in the master or the dev branch? Because in the dev branch @florianzwagemaker is working on a mode option which changes Jovian's stringency. Currently we have a strict mode setting, the default settings in the 'master' branch, with min contig length of 500 nt and Phred >30 requirement for raw reads. We know that is too strict (no false positives but many false negatives (for high Ct/low titer viruses)). Therefore, the 'relaxed mode' is included with min contig length (indeed to >=250 nt) and the minimum allowed Phred score of raw reads (to >=20). So please check that, because that influences the SPAdes assembly and might explain the weird results.

samnooij commented 5 years ago

I have been counting lots and lots of reads. It has been a very confusing and frustrating experience...

I have counted reads in fastq files and bam files, with seqkit and samtools. The problem is in the read mapping with BWA and then counting those reads. It looks like in some samples there are many multimappers. I have tried filtering uniques by taking read IDs (samtools ... | cut -f 1 | sort | uniq | wc -l), filtering based on mapping quality (samtools -q 1 ...), filtering based on a "SA:" flag (samtools ... | grep "SA:" | sort | uniq | wc -l), checking numbers with samtools flagstat, some counts from samtools (e.g. samtools -c -f 4 ...).

My conclusions:

(The 'closest' I think I have come is in a sample with 3,684,278 total reads (*pR1.fq + *pR2.fq = the input for BWA) where I counted 3,836,826 mapped + unmapped reads. In other words, the read count is not off by a few reads, but by a few hundred thousand reads, which is roughly 1-15% of the total number of reads in the sample. This number is also reflected by the Sample_composition_graph.html (shown as percentages) where I have seen bars up to ~110%.)

So far my counting reports. I am giving up for now.

(_N.B. the error is not in the visualisation by the Python script. The error is in the way mapped reads are counted = mpileup.sh in the 'scaffoldanalyses' environment. I have not found any better solution.)

edit: the contig minimum length filter was set to >=500. Trimming quality (SLIDINGWINDOW) to Phred 30.

samnooij commented 5 years ago

Together with @DennisSchmitz I found the right way to count the mapped reads, so we can fix this bug at last.

The right way to count mapped reads is like (pseudocode with shell commands):

mapped_paired = $(samtools view -c -F 4 -F 8 -F 256 -F 2048 data/scaffolds_filtered/${id}_L001_sorted.bam)

mapped_single = $(samtools view -c -F 4 -f 8 -F 256 -F 2048 data/scaffolds_filtered/${id}_L001_sorted.bam)

mapped_total = mapped_paired + mapped_single

unmapped = $(samtools view -c -f 4 -F 256 -F 2048 data/scaffolds_filtered/${id}_L001_sorted.bam)

Note that this still has to be implemented for each sample. (Sample name should go where '${id}' is now. Only required tool is samtools.)

_These numbers should replace the pileup.sh numbers and overwrite them in results/all_taxClassified.tsv._

_edit: maybe these numbers should not overwrite the output we generate now, but can be made in a new table that draw_heatmaps and quantify_profiles may use to quantify. A new rule to make this file still seems to be a good idea._

edit2: of course you can also count the number of mapped reads in one line: mapped_total = $(samtools view -c -F 4 -F 256 -F 2048 data/scaffolds_filtered/${id}_L001_sorted.bam)

samnooij commented 5 years ago

To add numbers per contig, we can use something like:

samtools view -F 4 -F 8 -F 256 -F 2048 data/scaffolds_filtered/{sample}_L00*_sorted.bam | cut -f 3 | sort | uniq -c

To filter contig names from the bam file, and count the unique occurrences. If you want to separately count forward and reverse reads, also filter flag 16 (i.e. do a -F 16 and -f 16 for forward and reverse, respectively).

samnooij commented 5 years ago

I made a new script, bin/count_mapped_reads.sh, and updated the Snakefile and bin/quantify_profiles.py accordingly. The problem should be solved now. I have no more > 100% bars in my test dataset. (See commit https://github.com/DennisSchmitz/Jovian/commit/61c823ba5e9e9efa12a8bbc0dfd7c3c0021502f9 and https://github.com/DennisSchmitz/Jovian/commit/2fbf87278b4e2ab510df5dfb532a0b794b506eeb)