apcamargo / pycoverm

Simple Python interface to CoverM's fast coverage estimation functions
GNU General Public License v3.0
7 stars 2 forks source link

Lower depth recorded with min_identity=0.0 than with 0.1 #7

Open jakobnissen opened 2 years ago

jakobnissen commented 2 years ago

Here is a minimal working example. I am using the attached file head2.bam with only 4 reads and 1 reference. Note that in order to upload to GitHub, the files must be gzipped - simply unzip the attached file to get the original BAM file.

Note that the depth INCREASES when increasing min_identity. Perhaps it somehow defaults min_identity=0.0 to min_identity=0.97?

>>> import pycoverm
>>> pycoverm.get_coverages_from_bam(["head2.bam"], min_identity=0.0)[1]
array([[0.]], dtype=float32)
>>> pycoverm.get_coverages_from_bam(["head2.bam"], min_identity=0.1)[1]
array([[0.00310749]], dtype=float32)

head2.bam.gz

apcamargo commented 2 years ago

Thanks for the report! I couldn't find anything obvious just by looking at the code. It doesn't seem like vanilla CoverM will default min_identity=0.0 to something higher. I'll try to look into this soon

apcamargo commented 2 years ago

I think I know what is going on. CoverM is checking whether the minimum identity is greater than 0.

I think the easiest solution here would be to set min_identity to something extremely small if the user sets it to zero. Would that work to you, @jakobnissen?

apcamargo commented 2 years ago

I couldn't figure out what CoverM does when the minimum identity is 0. The results it provides are correct, though. Pinging @wwood

jakobnissen commented 2 years ago

It's no rush for me to get it fixed - I set min_identity to 0.001 if the user sets it to 0. But it would be good to know why pycoverm behaves this way.

wwood commented 2 years ago

Hi,

Sorry I'm not really clear on the specifics here, but some notes

CoverM doesn't give those answers,

(coverm-dev)b2:20220505:~/git/coverm-local$ cargo run -- contig -b head2.bam 
    Finished dev [unoptimized + debuginfo] target(s) in 0.06s
     Running `target/debug/coverm contig -b head2.bam`
[2022-05-05T05:53:39Z INFO  bird_tool_utils::clap_utils] CoverM version 0.6.1
[2022-05-05T05:53:39Z INFO  coverm] Using min-covered-fraction 0%
[2022-05-05T05:53:40Z INFO  coverm::contig] In sample 'head2', found 4 reads mapped out of 4 total (100.00%)
Contig  head2 Mean
S5C6    0.0031074875

(coverm-dev)b2:20220505:~/git/coverm-local$ cargo run -- contig -b head2.bam --min-read-percent-identity 0.1
    Finished dev [unoptimized + debuginfo] target(s) in 0.06s
     Running `target/debug/coverm contig -b head2.bam --min-read-percent-identity 0.1`
[2022-05-05T05:54:15Z INFO  bird_tool_utils::clap_utils] CoverM version 0.6.1
[2022-05-05T05:54:15Z INFO  coverm] Using min-read-percent-identity 10%
[2022-05-05T05:54:15Z INFO  coverm] Using min-covered-fraction 0%
[2022-05-05T05:54:15Z INFO  coverm::contig] In sample 'head2', found 4 reads mapped out of 4 total (100.00%)
Contig  head2 Mean
S5C6    0.0031074875

(coverm-dev)b2:20220505:~/git/coverm-local$ cargo run -- contig -b head2.bam --min-read-percent-identity 0.0
    Finished dev [unoptimized + debuginfo] target(s) in 0.06s
     Running `target/debug/coverm contig -b head2.bam --min-read-percent-identity 0.0`
[2022-05-05T05:57:26Z INFO  bird_tool_utils::clap_utils] CoverM version 0.6.1
[2022-05-05T05:57:26Z INFO  coverm] Using min-read-percent-identity 0%
[2022-05-05T05:57:26Z INFO  coverm] Using min-covered-fraction 0%
[2022-05-05T05:57:26Z INFO  coverm::contig] In sample 'head2', found 4 reads mapped out of 4 total (100.00%)
Contig  head2 Mean
S5C6    0.0031074875

Maybe the default is being put in here? Not very familiar with the python side sorry. https://github.com/apcamargo/pycoverm/blob/81e81d9722cd75db5a0d116535df4b9b3532fc31/src/lib.rs#L87

ben

apcamargo commented 2 years ago

Thanks, @wwood. I also noticed that CoverM returns the correct coverage, I'm still not sure what is going on. Might be something on the Python/Maturin side of things, but it's very strange that it only happens when min_identity is 0.