FOI-Bioinformatics / flextaxd

FlexTaxD (Flexible Taxonomy Databases) - Create, add, merge different taxonomy sources (QIIME, GTDB, NCBI and more) and create metagenomic databases (kraken2, ganon and more )
GNU General Public License v3.0
65 stars 8 forks source link

flextaxd error: "ValueError: not enough values to unpack (expected 2, got 1)" #70

Open morien opened 9 months ago

morien commented 9 months ago

I'm attempting to follow along with this part of the tutorial/wiki, to get a better understanding of how to create my own custom DB. Things are okay until I get to the database creation step:

# flextaxd -db 16S_database.db -tf GTDB_arc_bact_taxo_tree_unique.txt -tt CanSNPer --genomeid2taxid g2id.txt --dump --dbprogram kraken2 -o taxonomy --verbose --logs logs/zenodo
2024-02-07 18:08:45,291 custom_taxonomy_databases [INFO ]  FlexTaxD logging initiated!
Warning: 16S_database.db already exists, overwrite? (y/n): y
2024-02-07 18:08:49,303 custom_taxonomy_databases [INFO ]  Loading module: ReadTaxonomyCanSNPer
2024-02-07 18:08:49,352 DatabaseConnection [INFO ]  16S_database.db opened successfully.
2024-02-07 18:08:49,353 ReadTaxonomyCanSNPer [INFO ]  GTDB_arc_bact_taxo_tree_unique.txt
2024-02-07 18:08:49,353 ReadTaxonomyCanSNPer [INFO ]  Fetching root name from file
2024-02-07 18:08:49,353 ReadTaxonomyCanSNPer [INFO ]  Adding, cellular organism node
2024-02-07 18:08:49,354 ReadTaxonomyCanSNPer [INFO ]  Adding root node root!
2024-02-07 18:08:49,355 custom_taxonomy_databases [INFO ]  Parse taxonomy
2024-02-07 18:08:49,355 ReadTaxonomyCanSNPer [INFO ]  Parse CanSNP tree file
2024-02-07 18:08:49,902 ReadTaxonomyCanSNPer [INFO ]  New taxonomy ids assigned 12929
Traceback (most recent call last):
  File "/home/nnnnnn/mambaforge/lib/python3.9/site-packages/flextaxd/modules/ReadTaxonomy.py", line 153, in parse_genomeid2taxid
    genomeid,taxid = row.strip().split("\t")
ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nnnnnn/mambaforge/bin/flextaxd", line 8, in <module>
    sys.exit(main())
  File "/home/nnnnnn/mambaforge/lib/python3.9/site-packages/flextaxd/custom_taxonomy_databases.py", line 330, in main
    read_obj.parse_genomeid2taxid(args.genomeid2taxid)
  File "/home/nnnnnn/mambaforge/lib/python3.9/site-packages/flextaxd/modules/ReadTaxonomy.py", line 156, in parse_genomeid2taxid
    genomeid,taxid,reference = row.strip().split("\t")
ValueError: not enough values to unpack (expected 3, got 1)

Here's the first few lines of my two input files:

# head g2id.txt 
GB_GCA_000010565.1      Pelotomaculum thermopropionicum
GB_GCA_000018565.1      Herpetosiphon aurantiacus
GB_GCA_000024525.1      Spirosoma linguale
GB_GCA_000091165.1      Methylomirabilis oxyfera_B
GB_GCA_000146855.1      Peptoanaerobacter margaretiae
GB_GCA_000147015.1      Zinderia insecticola
GB_GCA_000163995.1      Campylobacter_D jejuni_A
GB_GCA_000165065.1      Longicatena sp000165065
GB_GCA_000166295.1      Marinobacter adhaerens
GB_GCA_000168735.1      Endoriftia persephone
 # head GTDB_arc_bact_taxo_tree_unique.txt 
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;Aenigmatarchaeales;Aenigmatarchaeaceae;Aenigmatarchaeum;Aenigmatarchaeum_subterraneum
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;CG10238-14;CG10238-14;CG10238-14;CG10238-14_sp002789635
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;CG10238-14;CG10238-14;RBG-16-49-10;RBG-16-49-10_sp001784635
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;CG10238-14;EX4484-224;EX4484-224;EX4484-224_sp002254545
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;CG10238-14;SCSR01;SCSR01;SCSR01_sp004297575
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;GW2011-AR5;GCA-2688965;GCA-2688965;GCA-2688965_sp002688965
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;GW2011-AR5;GW2011-AR5;GW2011-AR5;GW2011-AR5_sp000806115
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;GW2011-AR5;GW2011-AR5;GW2011-AR5;GW2011-AR5_sp10154u
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;QMZP01;QMZP01;QMZP01;QMZP01_sp003663225
root;Archaea;Aenigmatarchaeota;Aenigmatarchaeia;QMZP01;QMZP01;QMZY01;QMZY01_sp003663415

I'd like to use this tool so any help is greatly appreciated

davve2 commented 9 months ago

Hi Morien,

It looks like the header may be the problem (if they are included in the files). If not I think the best option is if you could supply the head of your files as a text files, then we can replicate the error locally. The error itself tells says that the program finds too few columns separated by . What do you use for separation in your files? the default separator is \t

morien commented 9 months ago

g2id.txt.gz GTDB_arc_bact_taxo_tree_unique.txt.gz Okay great. Yes, the default separator is \t and that's what I see reflected in my input files. Should it be . instead? Here's my input files (entire files, gzipped).