Open tillea opened 5 years ago
Hi @tillea I've submitted your patch as a pull-request (#142).
@tillea I've had to reopen this issue and revert the patch due to the bug it introduced in #166.
I'm very hesitant to make Bioperl a dependency for Kraken 2. Can you explain what in the provided examples (which I can't find) is causing the parsing issue?
Hmmm, sorry that my patch broke some other things. :-( Bioperl has well tested routines to read and write sequences. It should be able to read any valid file format easily.
I don't doubt that bioperl is fairly well-tested, but introducing a dependency to Kraken 2 carries with it new problems (e.g., ensuring bioperl is installed on the user's computer in such a way that the K2 scripts can use it). Given that we've already got manual parsing in the C++ code (which probably has to stay given the interaction with multithreading), I don't think manually parsing fasta should be a problem. And given that there's no actual spec for the fasta format, I'm open to making a change to the Perl parsing if I understand the change I'm making. But I don't, at present, understand what is breaking in your current examples, so I don't know how to make a safe change to the code.
Can you point me to a file with the examples you're having problems with?
On Thu, Oct 03, 2019 at 06:49:37AM -0700, Derrick Wood wrote:
Can you point me to a file with the examples you're having problems with? The CI test suite of the Debian package has a sample data set:
https://salsa.debian.org/med-team/kraken/tree/master/debian/tests/test_data
This is processed with this script:
https://salsa.debian.org/med-team/kraken/blob/master/debian/tests/run-unit-test
Its basically identical for kraken2 (both version are using the same perl code). Hmmm, somehow there was a problem with kraken2 as I see commented text:
https://salsa.debian.org/med-team/kraken2/blob/master/debian/tests/run-unit-test
May be you can comment on this. Its quite some time since I dealt with this and its not my daily business. Kind regards, Andreas.
Hi Andreas,
Looking at the code and the example files, it does not appear that this is a FASTA parsing issue, but rather an issue with parsing individual items of the sequence ID header and their suitability within Kraken. It looks as if the FASTA headers only have GI numbers in them. These are no longer acceptable for use in Kraken (or Kraken 2) for aiding taxonomy lookups, due to NCBI's move away from GI numbers. The patch you used caused the sequence ID to become only a number (e.g., "441431932"), which was interpreted as a taxid by the kraken2lib::check_seqid() subroutine. Had the test actually tried to classify the reads, it would have found the taxids to be incorrect.
The error I'm seeing when I try to use the scan_fasta_file.pl script to examine your test FASTAs is:
scan_fasta_file.pl: unable to determine taxonomy ID for sequence gi|441431932|
The sequence ID is being parsed correctly by the script, but with the move away from GI numbers, the sequence ID now lacks any viable token for aiding taxonomy ID lookup. Acceptable replacements are either an explicit taxid in the sequence ID (e.g., >9606
or >humanseq|kraken:taxid|9606
) or an accession number (e.g., >NC_230938.1
).
In short, the test is failing because it is no longer appropriate. The Kraken 2 test should also have the --minimizer-len 5
removed because minimizer behavior is different in Kraken 2 vs. Kraken 1. In K1, the length of minimizers governed the size of the database.idx
file, which would be rather large (8 GB) by default, so changing the minimizer length for a test like this made sense. In Kraken 2, no such index exists. Removing the --minimizer-len 5
should allow the K2 test to work without needing to comment out the build/classification commands.
Thanks a lot for the explanation. I have forwarded it to the people working with it. However, I think one change for kraken[2] should be done: Print an appropriate error message explaining the issue. Kind regards, Andreas.
Hi, the Debian packaged kraken has added a simple test suite that used to work with version 0.10.5 but stoped working since version 1.0. Some investigation has shown that the fasta examles (for instance
can not be read an more due to a to simple parsing algorithm in file _scripts/scan_fastafile.pl.
I've added a very simple patch to the Debian package which does not cover all valid fasta header lines but at least works with the provided examples. Please provide a more robust method to parse fasta files. I'd strongly recommend using bioperl for this kind of tasks since it has solidly tested and robust code to deal with all kind of sequence formats. Kind regards, Andreas.