DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
707 stars 271 forks source link

Issue with `k2 add-to-library` #859

Open ChillarAnand opened 1 month ago

ChillarAnand commented 1 month ago

After adding ~3000 files or so, k2 add-to-library fails with this error

+ k2 add-to-library --db k2mdb --file /avilpage.com/anand/GCF_001347155.1_7054_6_6_genomic.fna
Calculating MD5 sum for GCF_001347155.1_7054_6_6_genomic.fna.
MD5 sum is 69bd16c04c88ce08b57bb515e9d1c4cc
Generating prelim_map.txt...done.
Masking low-complexity regions of added library.Could not file masked sequence in input
Please delete masked_sequences.txt before retrying.
Traceback (most recent call last):
  File "/usr/local/bin/k2", line 2337, in <module>
    k2_main()
  File "/usr/local/bin/k2", line 2266, in k2_main
    add_to_library(args)
  File "/usr/local/bin/k2", line 558, in add_to_library
    mask_files(
  File "/usr/local/bin/k2", line 518, in mask_files
    masker(fin, i + 1 == number_of_files)
  File "/usr/local/bin/k2", line 494, in masker
    shutil.copyfileobj(input_file, p.stdin)
  File "/usr/lib/python3.10/shutil.py", line 198, in copyfileobj
    fdst_write(buf)
BrokenPipeError: [Errno 32] Broken pipe

Using kraken2-build --add-to-library works fine though.

ch4rr0 commented 1 month ago

Thank you for reporting this issue. I am currently putting together some changes to help speed up k2 as well as iron out bugs such as these. This issue will be addressed in the next commit.

ChillarAnand commented 1 month ago

Thanksm @ch4rr0.

One more performance fix needs to be pushed to k2.

Currently kraken2-build is using shell commands for sequence to tax id map.

k2 is using python script which is slightly slower. Only this step can be delegated to a shell script which will improve k2 build performance.

In all other steps, performance of k2 is at par with kraken2-build.

I already have few other changes in my repo. https://github.com/DerrickWood/kraken2/compare/master...AvilPage:kraken2:master

ChillarAnand commented 1 month ago

For kraken-db-builder, I switched back to kraken2-build and using a custom md5 check for now.

https://github.com/AvilPage/kraken-db-builder

ch4rr0 commented 1 month ago

Upon further investigation I discovered that this was not a issue with the k2 script.

I added a feature to k2mask which would enable it to continue the masking process where it left off. If the fasta file being masked is the same between runs then this should proceed with issue. If the file gets modified I.e. a sequence that was previously masked got added/removed, k2mask will throw the error you received. This feature should have been disabled by default, but was implicitly enabled due to an error on my part. I have since pushed a commit to disable the feature.

I have also pushed some updates to the k2 script to improve download reliability and speed with some other performance improvements added. Please let me know if you find any issues with the updates.

ChillarAnand commented 1 month ago

Thanks for the quick update. I will test it out and let you know.

Wouldn't be faster to use the previous shell commands for assigning taxids? This is much faster with kraken2-build.

ch4rr0 commented 1 month ago

I'm using multiprocessing so it should be significantly faster now. My ethos is to reduce reliance on shell commands for portability and stability reasons.

ChillarAnand commented 1 month ago
❯ k2 add-to-library --db k2_protozoa --files /home/anand/.cache/kdb/refseq/protozoa/GCF_900002385.2/GCF_900002385.2_GCA_900002385_genomic.fna
Calculating MD5 sum for /home/anand/.cache/kdb/refseq/protozoa/GCF_900002385.2/GCF_900002385.2_GCA_900002385_genomic.fna.
MD5 sum of /home/anand/.cache/kdb/refseq/protozoa/GCF_900002385.2/GCF_900002385.2_GCA_900002385_genomic.fna is 1ca8e6f7ee3b5287443a578e40d0fe57
Generating prelim_map.txt.
Finished generating prelim_map.txt.
Traceback (most recent call last):
  File "/usr/local/bin/k2", line 2505, in <module>
    k2_main()
  File "/usr/local/bin/k2", line 2428, in k2_main
    add_to_library(args)
  File "/usr/local/bin/k2", line 559, in add_to_library
    [destination], destination + ".masked", protein=args.protein, threads=args.threads
AttributeError: 'Namespace' object has no attribute 'threads'

k2 is failing with error now.

ch4rr0 commented 1 month ago

Should be fixed in latest commit

ch4rr0 commented 1 month ago

I realized that the performance improvement applied only to download commands. I have gone ahead and applied to add-to-library also. You should now see significant time savings when using this workflow.

Please let me know if you encounter any issues.

ChillarAnand commented 1 month ago

Thanks for the quick patch.

This is working fine when running it on a single thread.

But when I run it using parallel, it still throws error.

$ find /AvilPage/refseq -name '*.fna' | head -n 10000 | parallel -j96 'k2 add-to-library --db k2_test --files {}'
Masking low-complexity regions of added library.Could not file masked sequence in input
Please delete masked_sequences.txt before retrying.
Traceback (most recent call last):
  File "/usr/local/bin/k2", line 2346, in <module>
    k2_main()
  File "/usr/local/bin/k2", line 2275, in k2_main
    add_to_library(args)
  File "/usr/local/bin/k2", line 557, in add_to_library
    mask_files(
  File "/usr/local/bin/k2", line 518, in mask_files
    masker(fin, i + 1 == number_of_files)
  File "/usr/local/bin/k2", line 494, in masker
    shutil.copyfileobj(input_file, p.stdin)
  File "/usr/lib/python3.10/shutil.py", line 198, in copyfileobj
    fdst_write(buf)
BrokenPipeError: [Errno 32] Broken pipe
ChillarAnand commented 1 month ago

Another important factor to consider is "resuming" the process when the process stops in the middle. Since building a decent db takes ~2 days, there are high chances that the process gets interrupted.

With kraken-db-builder, if the process stops when files are added to the library, it can almost resume instantly as it stores md5 hash at the original file location.

If there are ~500,000 files, k2 takes ~8 hours just to resume at the midpoint. This needs to be improved as well.

Even for build_db, we need to provide an option to incrementally update the db.

ch4rr0 commented 1 month ago

The quality of life improvements that you mentioned are being worked on. Right now I am trying to make sure that downloading is as reliable as possible since that's where most users are having issues.

ChillarAnand commented 1 month ago

I maintain additional scripts just for these. If you are okay with these changes, I can raise separate PRs for these.