DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
250 stars 73 forks source link

Database question #271

Closed SergeyBaikal closed 8 months ago

SergeyBaikal commented 8 months ago

Dear developers! When downloading databases, I was not able to download all sequences the first time. So I restarted and downloaded them again. I see more downloaded files in the catalog than there are in the list. Will this affect the working? And won't there be duplicates? Q

centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map

centrifuge --version
/usr/local/bin/centrifuge-class version 1.0.4
64-bit
Built on debian
Mon 26 Feb 23:56:16 GMT 2024
Compiler: gcc version 8.3.0 (Debian 8.3.0-7) 
Options: -O3 -m64 -msse2 -funroll-loops -g3 -std=c++11 -DPOPCNT_CAPABILITY
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
mourisl commented 8 months ago

This is the mysterious file format issue I could not reproduce. Do you still have the assembly_summary.txt file in those folders?

SergeyBaikal commented 8 months ago

Yes. This file is downloaded anew each time.

mourisl commented 8 months ago

Could you please share the file with me? I'll check which columns is misformatted.

SergeyBaikal commented 8 months ago

Yes, of course. Attached. assembly_summary.txt

mourisl commented 8 months ago

Ah, I thought the GCF.., 1_ViralProj14133 were different files, but they are just file name got wrapped. Can you send the list of the downloaded file names? I can do a cross-reference of those files with the assembly_summary file. Thank you.

SergeyBaikal commented 8 months ago

I attach a list of all downloaded file names in the directory.

flist.txt

mourisl commented 8 months ago

Thank you for sharing the files. It seems the extra files are the fasta files before dustmasker. You can "cat library//dustmasked.fna > input-sequences.fna" to create the concatenated file for centrifuge-build.

It is strange that the raw fasta file before dustmasking should be remove. Could you please check whether the file GCF_902141595.1_P2_4B2_genomic and its dustmasked version, GCF_902141595.1_P2_4B2_genomic_dustmasked.fna, have the same number of nucleotides? It's possible that the dustmasker failed on this sample somehow.

SergeyBaikal commented 8 months ago

The number of nucleotides of these two sequences is the same. For some reason the files before dustmasking were not deleted, although I tried the chmod 777command. The file input-sequences.fna now contains 105575 sequences.

Will these messages affect the result?

Warning: taxonomy id doesn't exists for NZ_CP046745.1!
Warning: taxonomy id doesn't exists for NZ_CP046746.1!
Warning: taxonomy id doesn't exists for NZ_CP094758.1!
Warning: taxonomy id doesn't exists for NZ_CP094759.1!
Warning: taxonomy id doesn't exists for NZ_CP138355.1!
Warning: Taxonomy ID 754189 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 990144 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1126411 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1195365 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1195373 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1268383 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1344113 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1398149 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1401445 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1439369 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1516075 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1516079 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1572237 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1732201 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1776868 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1968339 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1987722 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 2003501 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 2006690 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 2041149 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
mourisl commented 8 months ago

As long as you don't include those files in the concatenated fasta file that is the input for centrifuge-build, I think it should be fine.

SergeyBaikal commented 8 months ago

Thank you for helping me understand. Problem solved.