AdamaJava / adamajava

Other
14 stars 5 forks source link

qmotif - strange results when no includes are defined #312

Closed holmeso closed 2 years ago

holmeso commented 2 years ago

Describe the bug

When qmotif is run with the include_only flag set to false and an INCLUDES section defined in the ini file, all is well. When it is run with the include_only flag set to false and NO INCLUDES section defined in the ini file, all is not well.

To Reproduce

Steps to reproduce the behavior:

qmotif ini file with no INCLUDES:/mnt/lustre/working/genomeinfo/transient/20211015_ojh_qmotif_meso/qmotif_all.ini ini file with INCLUDES:/software/genomeinfo/configs/qmotif/qmotif.ini Both ini files haveincludes_only=false

java -jar qmotif.jar --ini <ini_file> -bam <bam> -bai <index> -o <output> -log <log>

Process will complete successfully, but outputs will differ (see below).

Expected behavior If the includes_only flag is set to true, then qmotif will only run against the reads that fall within the INCLUDES section as defined in the ini file. If the INCLUDES section is not present in the ini file then an error should be thrown.

If the includes_only flag is set to false, then qmotif will run against the whole BAM file. If an INCLUDES section exists, reads that fall in those regions will be categories accordingly. If they don't, then they fall into the GENOMIC category. If no INCLUDES section is defined, all (mapped) reads should fall into the GENOMIC category.

Screenshots

<counts>
<totalReadsInThisAnalysis count="1783473739"/>
<noOfMotifs count="119833"/>
<rawUnmapped count="10"/>
<rawIncludes count="121831"/>
<rawGenomic count="32"/>
<scaledUnmapped count="5"/>
<scaledIncludes count="68311"/>
<scaledGenomic count="17"/>
<bases_containing_motifs count="14083986"/>
</counts>

and

<totalReadsInThisAnalysis count="1783473739"/>
<noOfMotifs count="126"/>
<rawUnmapped count="10"/>
<rawIncludes count="0"/>
<rawGenomic count="66"/>
<scaledUnmapped count="5"/>
<scaledIncludes count="0"/>
<scaledGenomic count="37"/>
<bases_containing_motifs count="3798"/>

Email from JP 11/3/2022:

So both the totalReads and the rawUnmapped reads match so that is a good sign.

The bad sign is that the rawGenomic count in the second whole-bam report *should* be the sum of the rawIncludes and rawGenomic from the first report, i.e. it is 66 but should be 32+121823=121853.

If I look at the regions report and just concentrate on the type=“genome” regions, I should find every genomic region from the with-includes exactly reproduced in the full-bam BUT the full-bam should have a heap of extra genomic regions because, for it, all of the reads that would be in includes should have been found and placed into genomic regions.

This turns out to be somewhat true:

the good - every single “genomic” region from the with-includes report is in the full-bam report and is exactly the same, so … the BAM is getting scanned
the bad - the full-bam report only has a tiny number of extra READs (34) when it should have the whole 121823 reads from the incudes scattered across extra genomic regions.
the ugly - I guess the conclusion is that there is some sort of bug in qmotif in non-include mode.

It’s almost like the full-BAM run somehow skipped the includes. And I know the includes were not even in the full-bam INI so skipping them is impossible but something like that is going on.

Very odd.
holmeso commented 2 years ago

OK, so it looks like this is a feature! The GENOMIC RegionType does not look at records that are mapped.