FOI-Bioinformatics / flextaxd

FlexTaxD (Flexible Taxonomy Databases) - Create, add, merge different taxonomy sources (QIIME, GTDB, NCBI and more) and create metagenomic databases (kraken2, ganon and more )
GNU General Public License v3.0
65 stars 8 forks source link

flextaxd strain rank? #11

Closed punnettsun closed 4 years ago

punnettsun commented 4 years ago

Hello,

I have strain names in addition to species to domain level names, and I want to create an NCBI formatted nodes and names dmp files.

Is there a way to include strain names in the taxonomy database? I see that the ranks go from species to domain but is there a way to also allow for strain level too or this not something that NCBI uses in its files?

Thank you.

davve2 commented 4 years ago

Hello,

If a node has a rank (for example strain) assigned in the flextaxd database it will be printed when the database is exported. When I process NCBI the database dump files they do contain strain taxonomic levels which also follows to the flextaxd export.

So I think I would need some more information about what you have in your database to be able to answer your question. If you are using custom data to fill the database, could you please provide a small subset of your source data. Otherwise please provide some information about your source and which steps that you used to create the database.

punnettsun commented 4 years ago

I have attached a small arbitrary subset below of the file that has strains at the end of each line (Sample_taxonomy.txt). This is not in the correct format, and I was not sure how to keep the strain information without having to delete it. If I delete the strain information like in the Sample_taxonomy_wo_strains.txt file, I am able to use the following commands to generate the NCBI-style nodes.dmp and names.dmp:

flextaxd --taxonomy_file Sample_taxonomy_wo_strains.tsv --taxonomy_type QIIME --database .ftd flextaxd --dump

Sample_taxonomy.txt Sample_taxonomy_wo_strains.txt *I have changed tsv files into txt files as GitHub does not allow tsv file attachments. How do I keep the strain information and build a database with that?

Thank you.

punnettsun commented 4 years ago

I was able to get the strain level using the modification.txt file. I used this following command and now see strain added:

flextaxd --taxonomy_file Sample_taxonomy_2.tsv --taxonomy_type QIIME --mod_file Sample_modification.txt --genomeid2taxid Sample_accession2taxid.txt --parent "Sample_species" --dump

I suppose my question then becomes how can I do this with just flextaxd --taxonomy_file Sample_taxonomy_2.tsv --taxonomy_type QIIME --database .ftd without having to write a modification.txt file to include all strain names?

My initial Sample_taxonomy.txt file I had attached in my previous comment originally includes the strain names at the end of each line but it won't follow the QIIME format as there's a space in between "sspecies_name" and strain name. I don't know how to label the strain in my original Sample_taxonomy.txt file. I know domain would be "d", phylum would be "p__", but what would strain be?

davve2 commented 4 years ago

Ok this explains it, reading in QIIME mode it is in fact a limitation for sample, as I´m using the default levels of the format, I could potentially add a possibility to define further custom specifications in the file, however it will have to follow the same format, for example "x__strain_name".

If you think the modification option is enough I will leave this issue on hold for now, please feel free to reopen the issue if you think this would be a critical feature for usability.

I will add one extra X level option to use x__ when parsing QIIME formatted files.