Problems opening usual fasta files

tillea commented 5 years ago

Hi, the Debian packaged kraken has added a simple test suite that used to work with version 0.10.5 but stoped working since version 1.0. Some investigation has shown that the fasta examles (for instance

>gi|441431932| Acartia tonsa copepod circovirus isolate 154_D11, complete genome

can not be read an more due to a to simple parsing algorithm in file _scripts/scan_fastafile.pl.

I've added a very simple patch to the Debian package which does not cover all valid fasta header lines but at least works with the provided examples. Please provide a more robust method to parse fasta files. I'd strongly recommend using bioperl for this kind of tasks since it has solidly tested and robust code to deal with all kind of sequence formats. Kind regards, Andreas.

dfornika commented 5 years ago

Hi @tillea I've submitted your patch as a pull-request (#142).

DerrickWood commented 4 years ago

@tillea I've had to reopen this issue and revert the patch due to the bug it introduced in #166.

I'm very hesitant to make Bioperl a dependency for Kraken 2. Can you explain what in the provided examples (which I can't find) is causing the parsing issue?

tillea commented 4 years ago

Hmmm, sorry that my patch broke some other things. :-( Bioperl has well tested routines to read and write sequences. It should be able to read any valid file format easily.

DerrickWood commented 4 years ago

I don't doubt that bioperl is fairly well-tested, but introducing a dependency to Kraken 2 carries with it new problems (e.g., ensuring bioperl is installed on the user's computer in such a way that the K2 scripts can use it). Given that we've already got manual parsing in the C++ code (which probably has to stay given the interaction with multithreading), I don't think manually parsing fasta should be a problem. And given that there's no actual spec for the fasta format, I'm open to making a change to the Perl parsing if I understand the change I'm making. But I don't, at present, understand what is breaking in your current examples, so I don't know how to make a safe change to the code.

Can you point me to a file with the examples you're having problems with?

tillea commented 4 years ago

On Thu, Oct 03, 2019 at 06:49:37AM -0700, Derrick Wood wrote:

Can you point me to a file with the examples you're having problems with? The CI test suite of the Debian package has a sample data set:

https://salsa.debian.org/med-team/kraken/tree/master/debian/tests/test_data

This is processed with this script:

https://salsa.debian.org/med-team/kraken/blob/master/debian/tests/run-unit-test

Its basically identical for kraken2 (both version are using the same perl code). Hmmm, somehow there was a problem with kraken2 as I see commented text:

https://salsa.debian.org/med-team/kraken2/blob/master/debian/tests/run-unit-test

May be you can comment on this. Its quite some time since I dealt with this and its not my daily business. Kind regards, Andreas.

DerrickWood commented 4 years ago

Hi Andreas,

Looking at the code and the example files, it does not appear that this is a FASTA parsing issue, but rather an issue with parsing individual items of the sequence ID header and their suitability within Kraken. It looks as if the FASTA headers only have GI numbers in them. These are no longer acceptable for use in Kraken (or Kraken 2) for aiding taxonomy lookups, due to NCBI's move away from GI numbers. The patch you used caused the sequence ID to become only a number (e.g., "441431932"), which was interpreted as a taxid by the kraken2lib::check_seqid() subroutine. Had the test actually tried to classify the reads, it would have found the taxids to be incorrect.

The error I'm seeing when I try to use the scan_fasta_file.pl script to examine your test FASTAs is:

scan_fasta_file.pl: unable to determine taxonomy ID for sequence gi|441431932|

The sequence ID is being parsed correctly by the script, but with the move away from GI numbers, the sequence ID now lacks any viable token for aiding taxonomy ID lookup. Acceptable replacements are either an explicit taxid in the sequence ID (e.g., >9606 or >humanseq|kraken:taxid|9606) or an accession number (e.g., >NC_230938.1).

In short, the test is failing because it is no longer appropriate. The Kraken 2 test should also have the --minimizer-len 5 removed because minimizer behavior is different in Kraken 2 vs. Kraken 1. In K1, the length of minimizers governed the size of the database.idx file, which would be rather large (8 GB) by default, so changing the minimizer length for a test like this made sense. In Kraken 2, no such index exists. Removing the --minimizer-len 5 should allow the K2 test to work without needing to comment out the build/classification commands.

tillea commented 4 years ago

Thanks a lot for the explanation. I have forwarded it to the people working with it. However, I think one change for kraken[2] should be done: Print an appropriate error message explaining the issue. Kind regards, Andreas.

DerrickWood / kraken2

Problems opening usual fasta files #141