Phelimb / BIGSI

BItsliced Genomic Signature Index - Efficient indexing and search in very large collections of WGS data
http://www.bigsi.io
MIT License
124 stars 13 forks source link

Updating existing database, returns results for only the first bloom inserted. #28

Closed rpetit3 closed 6 years ago

rpetit3 commented 6 years ago

When updating an existing database, a search query will only return hits to the first query. The new sample is recognized as existing in the database.

Below is an example using the test data. I'm running BIGSI at the latest commit (cabebc7857e99d34549fc7b78eceae62c349884e) on an Ubuntu 16.04 machine with Python 3.6.3. Please let me know if you need any more details.

$ bigsi init test-bigsi --k 31 --m 1000 --h 1
INFO:bigsi.graph.probabilistic:Initialising BIGSI at test-bigsi
{'k': '31', 'm': '1000', 'h': '1', 'db': 'test-bigsi'}

$ bigsi build test-bigsi test1.bloom test2.bloom -s s1 -s s2
{'result': 'success'}

$ bigsi search -o tsv --db test-bigsi -s CGGCGAGGAAGCGTTAAATCTCTTTCTGACG
gene_name   sample_id   kmer_coverage_percent   time
CGGCGAGGAAGCGTTAAATCTCTTTCTGACG s1  100 0.00031876564025878906
CGGCGAGGAAGCGTTAAATCTCTTTCTGACG s2  100 0.00031876564025878906

$ bigsi build test-bigsi test3.bloom -s s3
{'result': 'success'}

$ bigsi search -o tsv --db test-bigsi -s CGGCGAGGAAGCGTTAAATCTCTTTCTGACG
gene_name   sample_id   kmer_coverage_percent   time
CGGCGAGGAAGCGTTAAATCTCTTTCTGACG s1  100 0.00028705596923828125

$ bigsi build test-bigsi test3.bloom -s s3
Traceback (most recent call last):
  File "/home/rpetit/.pyenv/versions/general/bin/bigsi", line 11, in <module>
    load_entry_point('bigsi==0.1.6', 'console_scripts', 'bigsi')()
  File "/home/rpetit/.pyenv/versions/general/lib/python3.6/site-packages/bigsi-0.1.6-py3.6.egg/bigsi/__main__.py", line 227, in main
    API.cli()
  File "hug/api.py", line 381, in hug.api.CLIInterfaceAPI.__call__
  File "hug/interface.py", line 439, in hug.interface.CLI.__call__
  File "hug/interface.py", line 118, in hug.interface.Interfaces.__call__
  File "/home/rpetit/.pyenv/versions/general/lib/python3.6/site-packages/bigsi-0.1.6-py3.6.egg/bigsi/__main__.py", line 143, in build
    return build(graph=BIGSI(db), bloomfilter_filepaths=bloomfilters, samples=samples)
  File "/home/rpetit/.pyenv/versions/general/lib/python3.6/site-packages/bigsi-0.1.6-py3.6.egg/bigsi/cmds/build.py", line 27, in build
    graph.build(bloomfilters, samples)
  File "/home/rpetit/.pyenv/versions/general/lib/python3.6/site-packages/bigsi-0.1.6-py3.6.egg/bigsi/graph/probabilistic.py", line 117, in build
    [self._add_sample(s) for s in samples]
  File "/home/rpetit/.pyenv/versions/general/lib/python3.6/site-packages/bigsi-0.1.6-py3.6.egg/bigsi/graph/probabilistic.py", line 117, in <listcomp>
    [self._add_sample(s) for s in samples]
  File "/home/rpetit/.pyenv/versions/general/lib/python3.6/site-packages/bigsi-0.1.6-py3.6.egg/bigsi/graph/probabilistic.py", line 346, in _add_sample
    raise ValueError("%s already exists in the db" % sample_name)
ValueError: s3 already exists in the db
Phelimb commented 6 years ago

Hi @rpetit3,

Thanks for the report! Sorry for the delay in getting it fixed.

build will rebuild the index from scratch so

bigsi build test-bigsi test3.bloom -s s3

Should be replaced with:

bigsi build test-bigsi test1.bloom test2.bloom test3.bloom -s s1 -s s2 -s s3 --force 1 OR bigsi insert test-bigsi test3.bloom s3

The insert command wasn't working at cabebc7 but I've pushed a fix here: https://github.com/Phelimb/BIGSI/commit/d4b6f8e44693312ace3461592cd432be643776f1

rpetit3 commented 6 years ago

Hey @Phelimb

Thanks for the update! Tested everything out, and insert is working like a charm now! By any chance do you have any recommendations on creating a 10k+ sample database? It will be a single species database, so not much difference in kmers.

Currently I'm testing, build then insert samples one at a time. I haven't tested inserting multiples ata time. Thought I would ask before I go further.

Thanks again for the update!

Phelimb commented 6 years ago

So, it depends slightly on your compute resources. Provided these samples are bacterial and ~5Mbp in size then the default parameters will work. You'll need ~280GB to build the index in memory (http://www.wolframalpha.com/input/?i=10,000*3.5MB*8) or ~35GB if you replace the transpose method in https://github.com/Phelimb/BIGSI/blob/master/bigsi/matrix/transpose.py with the one that doesn't use numpy (it's slower but uses less memory).

If you don't have a 300/40GB mem machine I would suggest building in chunks as long as possible and then merging the resulting indexes. The merge command is currently not working unfortunately but it simply iterates through all the rows in the berkeleyDB and concatenates them.

I have my PhD viva this week so I'll fix the merge command next week and write up a tutorial on how to build the index with the split/merge approach.