bingmann / cobs

COBS - Compact Bit-Sliced Signature Index (for Genomic k-Mer Data or q-Grams)
https://panthema.net/cobs
MIT License
83 stars 15 forks source link

Possible false negatives in query #5

Closed hmusta closed 4 years ago

hmusta commented 4 years ago

Hi,

I've been using COBS in a pipeline I'm working on, but I've noticed what appear to be false negatives in COBS' querying results.

I've used the following script to build compressed COBS and Mantis indexes for the attached input sequences

PREFIX=ERR1218773
PREFIX2=ERR1217061
NAME=small

# COBS
cobs compact-construct -k 31 -f 0.01 -T 1 --num-hashes 7 inputs/ $NAME.cobs_compact

# Mantis
squeakr count -e -k 31 -t 1 -o inputs/$PREFIX.ser inputs/$PREFIX.unitigs.fq
squeakr count -e -k 31 -t 1 -o inputs/$PREFIX2.ser inputs/$PREFIX2.unitigs.fq
echo inputs/$PREFIX.ser > inputs.txt
echo inputs/$PREFIX2.ser >> inputs.txt
mantis build -s 25 -i inputs.txt -o mantis

# query
cobs query -i $NAME.cobs_compact -f queries/query.fa -T 1 -t 0.0 --load-complete
mantis query -1 -p mantis/ -o /dev/stdout queries/query.seq.txt

When I query with Mantis, I get

[2019-10-24 17:46:48.174] [mantis_console] [info] Reading colored dbg from disk.
[2019-10-24 17:46:48.297] [mantis_console] [info] Read colored dbg with 9205186 k-mers and 3 color classes
[2019-10-24 17:46:48.297] [mantis_console] [info] Reading query kmers from disk.
[2019-10-24 17:46:48.298] [mantis_console] [info] Total k-mers to query: 831
[2019-10-24 17:46:48.298] [mantis_console] [info] Querying the colored dbg.
-----------------------------------------
| Query time  | Time = 277.25 us
-----------------------------------------
0       831
inputs/ERR1218773.ser   642
inputs/ERR1217061.ser   4
[2019-10-24 17:46:48.298] [mantis_console] [info] Writing done.

Whereas when I query the file with COBS, I get

Reading complete index
Read 33.331 MiB / 33.331 MiB - 100%
Index loaded into RAM.
*gb|HQ845196|+|0-861|ARO:3001109|SHV-52 2
ERR1217061      10
ERR1218773      1
TIMER info=search hashes=0.000268999 io=0.000221758 and rows=2.1032e-05 sort results=1.396e-06 total=0.000513185

If I exclude PREFIX2 (the second input file), I get the following result in COBS

Reading complete index
Read 20.586 MiB / 20.586 MiB - 100%
Index loaded into RAM.
*gb|HQ845196|+|0-861|ARO:3001109|SHV-52 1
ERR1218773      5
TIMER info=search hashes=0.000157673 io=0.000260737 and rows=2.2831e-05 sort results=1.326e-06 total=0.000442567

So it seems like the addition of extra samples leads to a reduction in the number of reported matches. I observe the same behavior if I construct a classic index as well. I've also done some tests with larger data sets where no matches are reported in cases where Mantis reports several.

Overall, the reported numbers are much lower than those reported by Mantis, so I'm not sure how to interpret these results.

inputs.tar.gz queries.tar.gz

Please let me know if there's any other info I can provide to help look into this.

Best, Harun

bingmann commented 4 years ago

thanks for the long bug report, I did find two important errors: the FastQ reader didnt work with your lines > 64K length. commit 2e79c3090a9b4a02565a3a2521c8b2b8d13fc23f And I fixed the DNA canonicalization. in commit b1720a1ff41a61933b1067e66250b325ea847ba9

bingmann commented 4 years ago

Hope the new version works.

hmusta commented 4 years ago

Thanks for your help! I'll let you know how things go

hmusta commented 4 years ago

Since commit b1720a1 was only applied to the classic index, can I assume that my compact indexes are correct?

bingmann commented 4 years ago

Yes, compacts are built out of classic indexes.

hmusta commented 4 years ago

I reran my script and it seems like the results are much closer, but still a bit off. Now, when I query, I get the following

Reading complete index
Read 384.157 MiB / 384.157 MiB - 100%
Index loaded into RAM.
*gb|HQ845196|+|0-861|ARO:3001109|SHV-52 2
ERR1218773      546
ERR1217061      7
TIMER info=search hashes=0.000107381 io=0.000480277 and rows=2.0661e-05 sort results=1.128e-06 total=0.000609447

The greater number of matches for ERR1217061 (7 instead of 4) can probably be explained by false positives, but I'm still not sure why there are only 546 matches to ERR1218773 instead of 642.

bingmann commented 4 years ago

Did you add the --canonicalize flag for cobs compact-construct?

I saw mantis mirrors lexicographically larger k-mers. COBS doesnt by default atm.

hmusta commented 4 years ago

Ok, that does indeed fix the problem, I had forgotten to re-enable it during my testing. Thanks for your help!

I'll close this issue then!