Closed hmusta closed 5 years ago
thanks for the long bug report, I did find two important errors: the FastQ reader didnt work with your lines > 64K length. commit 2e79c3090a9b4a02565a3a2521c8b2b8d13fc23f And I fixed the DNA canonicalization. in commit b1720a1ff41a61933b1067e66250b325ea847ba9
Hope the new version works.
Thanks for your help! I'll let you know how things go
Since commit b1720a1 was only applied to the classic index, can I assume that my compact indexes are correct?
Yes, compacts are built out of classic indexes.
I reran my script and it seems like the results are much closer, but still a bit off. Now, when I query, I get the following
Reading complete index
Read 384.157 MiB / 384.157 MiB - 100%
Index loaded into RAM.
*gb|HQ845196|+|0-861|ARO:3001109|SHV-52 2
ERR1218773 546
ERR1217061 7
TIMER info=search hashes=0.000107381 io=0.000480277 and rows=2.0661e-05 sort results=1.128e-06 total=0.000609447
The greater number of matches for ERR1217061
(7 instead of 4) can probably be explained by false positives, but I'm still not sure why there are only 546 matches to ERR1218773
instead of 642.
Did you add the --canonicalize
flag for cobs compact-construct?
I saw mantis mirrors lexicographically larger k-mers. COBS doesnt by default atm.
Ok, that does indeed fix the problem, I had forgotten to re-enable it during my testing. Thanks for your help!
I'll close this issue then!
Hi,
I've been using COBS in a pipeline I'm working on, but I've noticed what appear to be false negatives in COBS' querying results.
I've used the following script to build compressed COBS and Mantis indexes for the attached input sequences
When I query with Mantis, I get
Whereas when I query the file with COBS, I get
If I exclude
PREFIX2
(the second input file), I get the following result in COBSSo it seems like the addition of extra samples leads to a reduction in the number of reported matches. I observe the same behavior if I construct a classic index as well. I've also done some tests with larger data sets where no matches are reported in cases where Mantis reports several.
Overall, the reported numbers are much lower than those reported by Mantis, so I'm not sure how to interpret these results.
inputs.tar.gz queries.tar.gz
Please let me know if there's any other info I can provide to help look into this.
Best, Harun