DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
728 stars 273 forks source link

scan_fasta_file.pl: unable to determine taxonomy ID for sequence x #372

Open fanhuan opened 3 years ago

fanhuan commented 3 years ago

Hi,

I was trying to build a kraken2_nt database. Firstly I tried kraken2-build --download-library nt --db $DBNAME, however I was not able to download the database. The error message is:

Downloading nt database from server... rsync: read error: Connection reset by peer (104)
rsync error: error in socket IO (code 10) at io.c(794) [receiver=3.1.2]
rsync: connection unexpectedly closed (627 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(235) [generator=3.1.2]

Therefore I downloaded the fasta format of nt from NCBI and was hoping to add it to the database. When I was trying to it via kraken2-build --add-to-library nt.fa --db kraken2_nt, I got the error message:

scan_fasta_file.pl: unable to determine taxonomy ID for sequence 4W1Z_7

When I looked up 4W1Z_7 in nt.fa, the description line of that record is:

>X57170.1 B.taurus 5S rRNA gene4W1Z_7 Chain 7, Structure Of The Mammalian 60s Ribosomal Subunit (this Entry Contains The Large Ribosomal Subunit Rna)4W21_7 Chain 7, Structure Of The 80s Mammalian Ribosome Bound To Eef2 (this Entry Contains The Large Ribosomal Subunit Rna)4W24_7 Chain 7, Structure Of The Idle Mammalian Ribosome-sec61 Complex (this Entry Contains The Large Ribosomal Subunit Rna)4W26_7 Chain 7, Structure Of The Translating Mammalian Ribosome-sec61 Complex (this Entry Contains The Large Ribosomal Subunit Rna)

This means the accession number of this record should be X57170.1 instead of 4W1Z_7. I checked scan_fasta_file.pl. It says that

Headers are OK if a taxonomy ID is found (as either the entire sequence ID or as part of a "kraken:taxid" token), or if something looking like an accession number is found.

But I don't know why X57170.1 wasn't recognized as an accession number.

Any help would be appreciated.

CuypersBart commented 2 years ago

@fanhuan did you find any solution? I am having exactly the same issue with X57170.1.

rjsorr commented 2 years ago

likewise! any solution?

fanhuan commented 2 years ago

@fanhuan did you find any solution? I am having exactly the same issue with X57170.1.

Hi this has been a while and I don't remember what I did... I do have a working kraken2-nt database so I'm guessing I might have just deleted that entry from the fasta file... sorry.

rjsorr commented 2 years ago

I gave up! I deleted the offending sequence from a 260gb fasta file, though I could see nothing wrong with it, only to rerun and get an error for a new ID. This is obviously a bug with the program that needs fixing, with the solution being to either to ignore sequences with undetermined taxonomy ID, or list all offenders so a user can fix.

max-mapper commented 2 years ago

I also am experiencing this issue. It seems like the NT fasta uses ^A (aka \01) as a delimiter to put multiple accessions on one line. I replaced the ^A with \t and it fixed the above error

wget http://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz
pv nt.gz | zcat | cat -v | sed 's/\^A/\t/g' > nt.fasta
kraken2-build --download-taxonomy --db nt
kraken2-build --add-to-library ./nt.fasta --db nt
kraken2-build --build --threads 6 --db nt
max-mapper commented 2 years ago

Actually nevermind that didn't work. I now get errors like unable to determine taxonomy ID for sequence 3BBV_z which complain that seqid 3BBV_z can't be found in the taxonomy db. 3BBV_z correponds to taxid 274 but I'm not sure why it's not in the accession2taxid maps. The nt.fasta is (as of today) 9,919,957,516 lines long and the first "seqid can't be found" error occurs at line 250,738,815:

3BBV_z Chain z, tRNA(Phe)    4V4I_z Chain z, P-site PHE-tRNA 2OW8_z Chain z, P-site PHE-tRNA

So this line has PDB accession ids but not NCBI ones I guess?

max-mapper commented 2 years ago

I think the easiest way to fix this is to just make scan_fasta_file.pl print a warning for missing sequence ids but it shouldn't throw an error and just continue on, skipping the missing ones

Edit: I figured out how to do the above, you have to pass --lenient to scan_fasta_file.pl. This is done when downloading nr or nt but when building a custom library you have to edit the source code of scan_fasta_file.pl and add --lenient yourself

TheSallyGardens commented 2 years ago

I think the easiest way to fix this is to just make scan_fasta_file.pl print a warning for missing sequence ids but it shouldn't throw an error and just continue on, skipping the missing ones

Edit: I figured out how to do the above, you have to pass --lenient to scan_fasta_file.pl. This is done when downloading nr or nt but when building a custom library you have to edit the source code of scan_fasta_file.pl and add --lenient yourself

How can i fix it ? Edit:I edited the add_to_library.sh file and added the --lenient parameter. Perfectly solved.Tks!

rjsorr commented 2 years ago

I think the easiest way to fix this is to just make scan_fasta_file.pl print a warning for missing sequence ids but it shouldn't throw an error and just continue on, skipping the missing ones Edit: I figured out how to do the above, you have to pass --lenient to scan_fasta_file.pl. This is done when downloading nr or nt but when building a custom library you have to edit the source code of scan_fasta_file.pl and add --lenient yourself

How can i fix it ? Edit:I edited the add_to_library.sh file and added the --lenient parameter. Perfectly solved.Tks!

@TheSallyGardens can you expand on how you did this? show your editing/commands? or even attach your edited files?

TheSallyGardens commented 2 years ago

我认为解决此问题的最简单方法是仅scan_fasta_file.pl打印缺少序列 id 的警告,但它不应该抛出错误并继续,跳过缺少的那些 编辑:我想出了如何执行上述操作,你必须传给. --lenient_ scan_fasta_file.pl这是在下载nrnt构建自定义库时完成的,您必须编辑源代码scan_fasta_file.pl并添加--lenient自己

我该如何解决?编辑:我编辑了 add_to_library.sh 文件并添加了 --lenient 参数。完美解决。谢谢!

@TheSallyGardens你能详细说明你是如何做到的吗?显示您的编辑/命令?甚至附加您编辑的文件? 1656820013(1) First,search add_to_library.sh file. Then ,add --lenient to scan_fasta_file.pl, as shown in the figure.

rjsorr commented 2 years ago

add --lenient to scan_fasta_file.pl,

Cheers @TheSallyGardens !!! Seems to be working perfectly. Not throwing any errors as yet.

Russell

fgvieira commented 1 year ago

Having the same issue and it seems to be related with #50 (suggests the same solution as @TheSallyGardens). However, it would be nice to have an explicit option to enable --lenient behavior, rather than changing source files.