DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
235 stars 73 forks source link

Warning: Encountered reference sequence with only gaps #249

Open wittler-github opened 1 year ago

wittler-github commented 1 year ago

As you can see in attached files, I get this error many times, however centrifuge completes without error. It uses a vast data to build index about 40-70GB i reckon.

Is this a significant issue that one should clean up some NCBI indices .fna files for only showing NNNNN... and no real sequence ? Where the input .fna files was dustmasked with option centrifuge-download -d. Will this just be a statistical issue, that is negligible in the large amount of data used, or is it something one should rectify ?

centrifuge_build.zip

Warning: Encountered reference sequence with only gaps Warning: Encountered reference sequence with only gaps .....

mourisl commented 1 year ago

I think it is fine to ignore those sequences. Many such cases are from the dustmasker that removes the simple sequences and others. So even if keeping their original sequences, they are hard to be classified with.

wittler-github commented 1 year ago

I think so too, in this case only a very very small fraction of reference sequences showed this error, the very large input data (about 40-60 Gb) was dustmasked also.