Open ChillarAnand opened 3 months ago
Thank you for reporting this issue. I am currently putting together some changes to help speed up k2 as well as iron out bugs such as these. This issue will be addressed in the next commit.
Thanksm @ch4rr0.
One more performance fix needs to be pushed to k2.
Currently kraken2-build is using shell commands for sequence to tax id map.
k2 is using python script which is slightly slower. Only this step can be delegated to a shell script which will improve k2 build performance.
In all other steps, performance of k2 is at par with kraken2-build.
I already have few other changes in my repo. https://github.com/DerrickWood/kraken2/compare/master...AvilPage:kraken2:master
For kraken-db-builder, I switched back to kraken2-build and using a custom md5 check for now.
Upon further investigation I discovered that this was not a issue with the k2 script.
I added a feature to k2mask which would enable it to continue the masking process where it left off. If the fasta file being masked is the same between runs then this should proceed with issue. If the file gets modified I.e. a sequence that was previously masked got added/removed, k2mask will throw the error you received. This feature should have been disabled by default, but was implicitly enabled due to an error on my part. I have since pushed a commit to disable the feature.
I have also pushed some updates to the k2 script to improve download reliability and speed with some other performance improvements added. Please let me know if you find any issues with the updates.
Thanks for the quick update. I will test it out and let you know.
Wouldn't be faster to use the previous shell commands for assigning taxids? This is much faster with kraken2-build.
I'm using multiprocessing so it should be significantly faster now. My ethos is to reduce reliance on shell commands for portability and stability reasons.
❯ k2 add-to-library --db k2_protozoa --files /home/anand/.cache/kdb/refseq/protozoa/GCF_900002385.2/GCF_900002385.2_GCA_900002385_genomic.fna
Calculating MD5 sum for /home/anand/.cache/kdb/refseq/protozoa/GCF_900002385.2/GCF_900002385.2_GCA_900002385_genomic.fna.
MD5 sum of /home/anand/.cache/kdb/refseq/protozoa/GCF_900002385.2/GCF_900002385.2_GCA_900002385_genomic.fna is 1ca8e6f7ee3b5287443a578e40d0fe57
Generating prelim_map.txt.
Finished generating prelim_map.txt.
Traceback (most recent call last):
File "/usr/local/bin/k2", line 2505, in <module>
k2_main()
File "/usr/local/bin/k2", line 2428, in k2_main
add_to_library(args)
File "/usr/local/bin/k2", line 559, in add_to_library
[destination], destination + ".masked", protein=args.protein, threads=args.threads
AttributeError: 'Namespace' object has no attribute 'threads'
k2
is failing with error now.
Should be fixed in latest commit
I realized that the performance improvement applied only to download commands. I have gone ahead and applied to add-to-library also. You should now see significant time savings when using this workflow.
Please let me know if you encounter any issues.
Thanks for the quick patch.
This is working fine when running it on a single thread.
But when I run it using parallel, it still throws error.
$ find /AvilPage/refseq -name '*.fna' | head -n 10000 | parallel -j96 'k2 add-to-library --db k2_test --files {}'
Masking low-complexity regions of added library.Could not file masked sequence in input
Please delete masked_sequences.txt before retrying.
Traceback (most recent call last):
File "/usr/local/bin/k2", line 2346, in <module>
k2_main()
File "/usr/local/bin/k2", line 2275, in k2_main
add_to_library(args)
File "/usr/local/bin/k2", line 557, in add_to_library
mask_files(
File "/usr/local/bin/k2", line 518, in mask_files
masker(fin, i + 1 == number_of_files)
File "/usr/local/bin/k2", line 494, in masker
shutil.copyfileobj(input_file, p.stdin)
File "/usr/lib/python3.10/shutil.py", line 198, in copyfileobj
fdst_write(buf)
BrokenPipeError: [Errno 32] Broken pipe
Another important factor to consider is "resuming" the process when the process stops in the middle. Since building a decent db takes ~2 days, there are high chances that the process gets interrupted.
With kraken-db-builder, if the process stops when files are added to the library, it can almost resume instantly as it stores md5 hash at the original file location.
If there are ~500,000 files, k2
takes ~8 hours just to resume at the midpoint. This needs to be improved as well.
Even for build_db, we need to provide an option to incrementally update the db.
The quality of life improvements that you mentioned are being worked on. Right now I am trying to make sure that downloading is as reliable as possible since that's where most users are having issues.
I maintain additional scripts just for these. If you are okay with these changes, I can raise separate PRs for these.
After adding ~3000 files or so,
k2 add-to-library
fails with this errorUsing
kraken2-build --add-to-library
works fine though.