DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
249 stars 73 forks source link

taxonomy id doesn't exist! #117

Open shlomobl opened 6 years ago

shlomobl commented 6 years ago

Hi,

While I'm building my index with centrifuge-build, I got these errors:

1) Warning: Encountered reference sequence with only gaps

2) Warning: taxomony id doesn't exists for NZ_FWVH01000246.AA! Warning: taxomony id doesn't exists for NGGTT! Warning: taxomony id doesn't exists for NZ_FWQG010ACCGATCAGCAGCACCAGCAGCAGGCAGGCCATTACCGCCCCCAGCGA! Warning: taxomony id doesn't exists for NZ_FWDZ010001TTATTATTATGCCAACCATTGGTTTTA!

Does it mean something wrong with the input files? Where should I look for errors?

mourisl commented 6 years ago

It's normal that there are a few sequences that can not be found in the taxonomy tree. If you want them to be in the index, you need to add those terms in the seqid_to_taxid file. For example, you need to add a taxonomy id for NZ_FWVH01000246.AA. And you also need to make sure the taxonomy id is in the nodes.dmp and names.dmp files as well.

shlomobl commented 6 years ago

Thanks. What about the second error: _Warning: taxomony id doesn't exists for NZ_FWVH01000246.AA! Warning: taxomony id doesn't exists for NGGTT! Warning: taxomony id doesn't exists for NZ_FWQG010ACCGATCAGCAGCACCAGCAGCAGGCAGGCCATTACCGCCCCCAGCGA! Warning: taxomony id doesn't exists for NZFWDZ010001TTATTATTATGCCAACCATTGGTTTTA! Looks like something went wrong with the reference file? but I can't find it...

mourisl commented 6 years ago

That is about the second warning.

The "only gaps" warning might be about the sequences with low complexity.

shlomobl commented 6 years ago

Isn't it strange that the beginning of the sequence was joined to the accession number NZ_FWQG010ACCGAT... perhaps this is causing the error message?

mourisl commented 6 years ago

Yes. But I'm not sure what causes the concatenation.

shlomobl commented 6 years ago

At least I found that it happens during the "cat" step when generating the input reference files. The original *.fna file downloaded seems to be OK. It's strange because I can't find a pattern for this error, say, every X entries.

mourisl commented 6 years ago

Oh, I see. Could you please run "ls | grep ".fna$" | xargs cat >> ..." to concat the files?

shlomobl commented 6 years ago

Hmmm that's what I did, but you mean without -n and -P options?

mourisl commented 6 years ago

Yes, without them.

SK-N-BE commented 6 years ago

Hei, I have the same issue but for all of the IDs. This is what I have done:

centrifuge-download -o taxonomy taxonomy
centrifuge-download -o library -m -d "bacteria" refseq > seqid2taxid.map

after downloading:

cat library/*/*.fna > input-sequences.fna

--> when the fna file was created:

centrifuge-build -p 4 --conversion-table seqid2taxid.map \
                     --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp \
input-sequences.fna centrifuge_bacteria

Three files are generated:

  1. centrifuge_bacteria.1.cf
  2. centrifuge_bacteria.2.cf
  3. centrifuge_bacteria.3.cf

However, for all bacteria in the database I get the message:

Warning: taxomony id doesn't exists for NZ_LT969520.1! Warning: taxomony id doesn't exists for NZ_LT985474.1! Warning: taxomony id doesn't exists for NZ_LT985188.1! Warning: taxomony id doesn't exists for NZ_LT990039.1! Warning: taxomony id doesn't exists for NZ_LT991954.1! Warning: taxomony id doesn't exists for NZ_LT991955.1! Warning: taxomony id doesn't exists for NZ_LT991956.1! Warning: taxomony id doesn't exists for NZ_LT991957.1! Warning: taxomony id doesn't exists for NZ_LT991958.1! Warning: taxomony id doesn't exists for NZ_LT991959.1! Warning: taxomony id doesn't exists for NZ_LT991960.1! Warning: taxomony id doesn't exists for NZ_LT992488.1! Warning: taxomony id doesn't exists for NZ_LT992489.1! Warning: taxomony id doesn't exists for NZ_LT992486.1! Warning: taxomony id doesn't exists for NZ_LT992487.1! Warning: taxomony id doesn't exists for NZ_LT992492.1! Warning: taxomony id doesn't exists for NZ_LT992493.1! Warning: taxomony id doesn't exists for NZ_LT992502.1! Warning: taxomony id doesn't exists for NZ_LS398547.1!

and so on The files nmes.dmp and nodes.dmp do exist

When using ls | grep ".fna$" | xargs cat >> sequences.fna I only get an empty file

mourisl commented 6 years ago

It's normal to have some taxonomy ids missing. You can grep, for example, "NZ_LT969520" in the *.map file to make sure. If nothing found, that means the corresponding genome is somehow not registered in the taxonomy tree.

barakova commented 5 years ago

Hi,

I am having the same issue as mentioned by @SK-N-BE. I understand this is not a problem but it says these warnings and then does nothing. I let it ran for one day and it did not stop. When tyring the same again it stopped at the same part. Should I stop it by myself and does it mean it has ended? I don't know how to know if everything went as planned.

VirtualBox_Ubuntu_07_08_2019_12_26_23

Thank you and hope you understand my question.

Alžběta