Open fanhuan opened 3 years ago
@fanhuan did you find any solution? I am having exactly the same issue with X57170.1.
likewise! any solution?
@fanhuan did you find any solution? I am having exactly the same issue with X57170.1.
Hi this has been a while and I don't remember what I did... I do have a working kraken2-nt database so I'm guessing I might have just deleted that entry from the fasta file... sorry.
I gave up! I deleted the offending sequence from a 260gb fasta file, though I could see nothing wrong with it, only to rerun and get an error for a new ID. This is obviously a bug with the program that needs fixing, with the solution being to either to ignore sequences with undetermined taxonomy ID, or list all offenders so a user can fix.
I also am experiencing this issue. It seems like the NT fasta uses ^A
(aka \01
) as a delimiter to put multiple accessions on one line. I replaced the ^A
with \t
and it fixed the above error
wget http://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz
pv nt.gz | zcat | cat -v | sed 's/\^A/\t/g' > nt.fasta
kraken2-build --download-taxonomy --db nt
kraken2-build --add-to-library ./nt.fasta --db nt
kraken2-build --build --threads 6 --db nt
Actually nevermind that didn't work. I now get errors like unable to determine taxonomy ID for sequence 3BBV_z
which complain that seqid 3BBV_z
can't be found in the taxonomy db. 3BBV_z
correponds to taxid 274
but I'm not sure why it's not in the accession2taxid
maps. The nt.fasta
is (as of today) 9,919,957,516 lines long and the first "seqid can't be found" error occurs at line 250,738,815:
3BBV_z Chain z, tRNA(Phe) 4V4I_z Chain z, P-site PHE-tRNA 2OW8_z Chain z, P-site PHE-tRNA
So this line has PDB accession ids but not NCBI ones I guess?
I think the easiest way to fix this is to just make scan_fasta_file.pl
print a warning for missing sequence ids but it shouldn't throw an error and just continue on, skipping the missing ones
Edit: I figured out how to do the above, you have to pass --lenient
to scan_fasta_file.pl
. This is done when downloading nr
or nt
but when building a custom library you have to edit the source code of scan_fasta_file.pl
and add --lenient
yourself
I think the easiest way to fix this is to just make
scan_fasta_file.pl
print a warning for missing sequence ids but it shouldn't throw an error and just continue on, skipping the missing onesEdit: I figured out how to do the above, you have to pass
--lenient
toscan_fasta_file.pl
. This is done when downloadingnr
ornt
but when building a custom library you have to edit the source code ofscan_fasta_file.pl
and add--lenient
yourself
How can i fix it ? Edit:I edited the add_to_library.sh file and added the --lenient parameter. Perfectly solved.Tks!
I think the easiest way to fix this is to just make
scan_fasta_file.pl
print a warning for missing sequence ids but it shouldn't throw an error and just continue on, skipping the missing ones Edit: I figured out how to do the above, you have to pass--lenient
toscan_fasta_file.pl
. This is done when downloadingnr
ornt
but when building a custom library you have to edit the source code ofscan_fasta_file.pl
and add--lenient
yourselfHow can i fix it ? Edit:I edited the add_to_library.sh file and added the --lenient parameter. Perfectly solved.Tks!
@TheSallyGardens can you expand on how you did this? show your editing/commands? or even attach your edited files?
我认为解决此问题的最简单方法是仅
scan_fasta_file.pl
打印缺少序列 id 的警告,但它不应该抛出错误并继续,跳过缺少的那些 编辑:我想出了如何执行上述操作,你必须传给.--lenient
_scan_fasta_file.pl
这是在下载nr
或nt
构建自定义库时完成的,您必须编辑源代码scan_fasta_file.pl
并添加--lenient
自己我该如何解决?编辑:我编辑了 add_to_library.sh 文件并添加了 --lenient 参数。完美解决。谢谢!
@TheSallyGardens你能详细说明你是如何做到的吗?显示您的编辑/命令?甚至附加您编辑的文件? First,search add_to_library.sh file. Then ,add --lenient to scan_fasta_file.pl, as shown in the figure.
add --lenient to scan_fasta_file.pl,
Cheers @TheSallyGardens !!! Seems to be working perfectly. Not throwing any errors as yet.
Russell
Having the same issue and it seems to be related with #50 (suggests the same solution as @TheSallyGardens). However, it would be nice to have an explicit option to enable --lenient
behavior, rather than changing source files.
Hi,
I was trying to build a kraken2_nt database. Firstly I tried
kraken2-build --download-library nt --db $DBNAME
, however I was not able to download the database. The error message is:Therefore I downloaded the fasta format of nt from NCBI and was hoping to add it to the database. When I was trying to it via
kraken2-build --add-to-library nt.fa --db kraken2_nt
, I got the error message:When I looked up
4W1Z_7
in nt.fa, the description line of that record is:This means the accession number of this record should be
X57170.1
instead of4W1Z_7
. I checked scan_fasta_file.pl. It says thatBut I don't know why
X57170.1
wasn't recognized as an accession number.Any help would be appreciated.