Closed SergeyBaikal closed 8 months ago
This is the mysterious file format issue I could not reproduce. Do you still have the assembly_summary.txt file in those folders?
Yes. This file is downloaded anew each time.
Could you please share the file with me? I'll check which columns is misformatted.
Yes, of course. Attached. assembly_summary.txt
Ah, I thought the GCF.., 1_ViralProj14133 were different files, but they are just file name got wrapped. Can you send the list of the downloaded file names? I can do a cross-reference of those files with the assembly_summary file. Thank you.
I attach a list of all downloaded file names in the directory.
Thank you for sharing the files. It seems the extra files are the fasta files before dustmasker. You can "cat library//dustmasked.fna > input-sequences.fna" to create the concatenated file for centrifuge-build.
It is strange that the raw fasta file before dustmasking should be remove. Could you please check whether the file GCF_902141595.1_P2_4B2_genomic and its dustmasked version, GCF_902141595.1_P2_4B2_genomic_dustmasked.fna, have the same number of nucleotides? It's possible that the dustmasker failed on this sample somehow.
The number of nucleotides of these two sequences is the same. For some reason the files before dustmasking were not deleted, although I tried the chmod 777
command. The file input-sequences.fna now contains 105575 sequences.
Will these messages affect the result?
Warning: taxonomy id doesn't exists for NZ_CP046745.1!
Warning: taxonomy id doesn't exists for NZ_CP046746.1!
Warning: taxonomy id doesn't exists for NZ_CP094758.1!
Warning: taxonomy id doesn't exists for NZ_CP094759.1!
Warning: taxonomy id doesn't exists for NZ_CP138355.1!
Warning: Taxonomy ID 754189 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 990144 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1126411 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1195365 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1195373 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1268383 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1344113 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1398149 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1401445 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1439369 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1516075 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1516079 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1572237 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1732201 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1776868 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1968339 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 1987722 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 2003501 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 2006690 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
Warning: Taxonomy ID 2041149 is not in the provided taxonomy tree (taxonomy/nodes.dmp)!
As long as you don't include those files in the concatenated fasta file that is the input for centrifuge-build, I think it should be fine.
Thank you for helping me understand. Problem solved.
Dear developers! When downloading databases, I was not able to download all sequences the first time. So I restarted and downloaded them again. I see more downloaded files in the catalog than there are in the list. Will this affect the working? And won't there be duplicates?
centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map