Iss33 - Githubissues

FOI-Bioinformatics / flextaxd

FlexTaxD (Flexible Taxonomy Databases) - Create, add, merge different taxonomy sources (QIIME, GTDB, NCBI and more) and create metagenomic databases (kraken2, ganon and more )

GNU General Public License v3.0

65 stars 8 forks source link

Iss33 #39

Closed davve2 closed 3 years ago

HitMonk commented 3 years ago

I wasnt sure which thread to continue the conversation in, but since the pervious thread is merged with this, i thought I would add comments and issues in this thread. Please correct me if im wrong.

Thank you for helping me with the new version of FlextaxD. I however still have issues with building the 16s Kraken2 database. I have added the commands that i used below. I built the new 0.3.6 flextaxdb with python since for some reason it wouldnt install with conda.

Version check

prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ flextaxd --version
custom_taxonomy_databases: version 0.3.6
Maintaner group: FOI bioinformatics group (bioinformatics@foi.se, david.sundell@foi.se)
Github: https://github.com/FOI-Bioinformatics/flextaxd

Follow commands to build Kraken2 database:

awk -F'(' '{print $1}' GTDB_arc_bact_taxo.txt | awk '{print $2" "$3}' | sort | uniq > taxonomy_unique.txt
sed -i 's/Bacteria/root;Bacteria/g' taxonomy_unique.txt
sed -i 's/Archaea/root;Archaea/g' taxonomy_unique.txt

prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ flextaxd -db rRNA_custom_tree -tf taxonomy_unique.txt -tt CanSNPer -o taxonomy --dump --dbprogram kraken2
Creating a new FlexTaxD database rRNA_custom_tree using taxonomy_unique.txt, press any key to continue...
Traceback (most recent call last):
  File "/exports/watson/Prateek/apps/miniconda3/lib/python3.8/site-packages/flextaxd-0.3.6-py3.8.egg/flextaxd/modules/database/CreateDatabase.py", line 96, in <module>
    CDB = CreateDatabase()
  File "/exports/watson/Prateek/apps/miniconda3/lib/python3.8/site-packages/flextaxd-0.3.6-py3.8.egg/flextaxd/modules/database/CreateDatabase.py", line 19, in __init__
    super().__init__()
TypeError: super() takes at least 1 argument (0 given)
2021-02-26 19:09:50,900 DatabaseConnection [WARNI]  Error in DatabaseConnection query
2021-02-26 19:09:50,900 DatabaseConnection [WARNI]
                        INSERT INTO nodes(name)
                        VALUES (?)
.
.
.
no such table: nodes
2021-02-26 19:06:23,562 DatabaseConnection [WARNI]  Error in DatabaseConnection query
2021-02-26 19:06:23,562 DatabaseConnection [WARNI]
                        INSERT INTO tree(child,parent,rank_i)
                        VALUES (?,?,?)

no such table: tree
Traceback (most recent call last):
  File "/exports/watson/Prateek/apps/miniconda3/bin/flextaxd", line 33, in <module>
    sys.exit(load_entry_point('flextaxd==0.3.6', 'console_scripts', 'flextaxd')())
  File "/exports/watson/Prateek/apps/miniconda3/lib/python3.8/site-packages/flextaxd-0.3.6-py3.8.egg/flextaxd/custom_taxonomy_databases.py", line 267, in main
    read_obj.parse_taxonomy()                                                           ## Parse taxonomy file
  File "/exports/watson/Prateek/apps/miniconda3/lib/python3.8/site-packages/flextaxd-0.3.6-py3.8.egg/flextaxd/modules/ReadTaxonomyCanSNPer.py", line 106, in parse_taxonomy
    self.length = self.taxid_num - self.root                ## Check number of new nodes added
TypeError: unsupported operand type(s) for -: 'int' and 'OperationalError'

Im not sure if this is a python issue or something that i am doing wrong.

Thank you!

davve2 commented 3 years ago

Hi, thanks for the report! I can unfortunately not reproduce this issue.

Is is possible for you to try installing flextaxd with conda again (It may not have been updated properly since there is some delay before the update is found by conda). It was working for me this morning with this version! I would suggest to create a new environment conda create -n flextaxd flextaxd

If you still have the same issue I will have to find a way to reproduce the issue.

Update:

I tried your solution from three different servers and can still not reproduce the error above. (Using the exact commands posted above) and the GTDB_arc_bact_taxo.txt source file.
I also tried creating a python 3.8 conda environment with flextaxd and no issues (default python with conda is currently python 3.9)

davve2 commented 3 years ago

@HitMonk Any progress?

HitMonk commented 3 years ago

Hello @davve2 , Apologies for the delayed response! I installed FlextaxD through conda and the previous error disappeared. i ran through all the commands that you provided and seem to face the empty kraken2 database issue. The kraken database seems to be created but it looks like its empty. below is the log:

(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ flextaxd-create -db rRNA_custom_tree --create_db --db_name krakendb --genomes_path genomes/ --dbprogram kraken2 -o taxonomy --verbose --processes 8
2021-03-02 19:42:04,606 create_databases [INFO ]  FlexTaxD-create logging initiated!
2021-03-02 19:42:04,621 create_databases [INFO ]  Processing files; create kraken seq.map
2021-03-02 19:42:04,622 DatabaseConnection [INFO ]  rRNA_custom_tree opened successfully.
2021-03-02 19:42:04,686 ProcessDirectory [INFO ]  Number of genomes annotated in database 23091
2021-03-02 19:42:04,686 ProcessDirectory [INFO ]  Process genome path (genomes/)
2021-03-02 19:42:04,902 ProcessDirectory [INFO ]  Processed 23092 genomes
2021-03-02 19:42:04,914 create_databases [INFO ]  Loading module: CreateKrakenDatabase
2021-03-02 19:42:04,937 DatabaseConnection [INFO ]  rRNA_custom_tree opened successfully.
2021-03-02 19:42:05,004 CreateKrakenDatabase [INFO ]  krakendb
2021-03-02 19:42:05,013 create_databases [INFO ]  --- process finished in 0 minutes 0.41902661323547363 seconds---

2021-03-02 19:42:05,015 CreateKrakenDatabase [INFO ]  Create library directory
2021-03-02 19:42:05,017 CreateKrakenDatabase [INFO ]  Processing files; create kraken seq.map
2021-03-02 19:44:55,436 CreateKrakenDatabase [INFO ]  Number of genomes succesfully added to the kraken2 database: 23091
2021-03-02 19:44:55,437 create_databases [INFO ]  Genome folder preprocessing completed!
2021-03-02 19:44:55,437 create_databases [INFO ]  --- process finished in 2 minutes 50.843384981155396 seconds---

2021-03-02 19:44:55,437 create_databases [INFO ]  Create database
2021-03-02 19:44:55,438 CreateKrakenDatabase [INFO ]  mkdir -p krakendb/taxonomy
2021-03-02 19:44:55,447 CreateKrakenDatabase [INFO ]  cp taxonomy/*.dmp krakendb/taxonomy
2021-03-02 19:44:55,555 CreateKrakenDatabase [INFO ]  cp taxonomy/*.map krakendb
2021-03-02 19:44:55,555 CreateKrakenDatabase [INFO ]  kraken2-build --build --db krakendb  --threads 8
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.078s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 10415540 bytes
Capacity estimation complete. [1.089s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 16 bits reserved for taxid.
Completed processing of 0 sequences, 0 bp
Writing data to disk...  complete.
Database files completed. [0.368s]
Database construction complete. [Total: 1.561s]
2021-03-02 19:44:57,207 CreateKrakenDatabase [INFO ]  Create inspect file!
Database disk usage: 12K
After cleaning, database uses 5.5K
2021-03-02 19:44:57,613 CreateKrakenDatabase [INFO ]  kraken2 database created
2021-03-02 19:44:57,614 create_databases [INFO ]  --- Time summary  2 minutes 53.020384311676025 seconds---

But when i look at the database: it looks like its empty, just like before:

(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ kraken2-inspect --db krakendb/ --use-mpa-style
# Database options: nucleotide db, k = 35, l = 31
# Spaced mask = 11111111111111111111111111111111110011001100110011001100110011
# Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101
# Total taxonomy nodes: 33067
# Table size: 0
# Table capacity: 2603885
# Min clear hash value = 0

Also, the node.dmp file still has the taxonomy encoded as no rank as before, but the seqid2taxid.map file seems like its finding all the correct files? It looks like the Step 3 in kraken building isnt working. For me it shows 0 genomes and 0 bps build, however, your log shows Completed processing of 23091 sequences, 31017889 bp

Do you have any suggestions on where im going wrong? The other commands that i used prior to building the kraken2 database above are below:

(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ flextaxd -db rRNA_custom_tree -tf taxonomy_unique.txt -tt CanSNPer -o taxonomy --dump --dbprogram kraken2
Creating a new FlexTaxD database rRNA_custom_tree using taxonomy_unique.txt, press any key to continue...
2021-03-02 20:06:17,324 custom_taxonomy_databases [WARNI]  Warning no genomeid2taxid file given!

(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ flextaxd -db rRNA_custom_tree --genomeid2taxid genomeid2taxid.txt

i have also attached the seqid2taxid.map and node.dmp files, the extensions are changed to .txt so i can upload it :)

Finally, the version of flextaxd for sanity check, (flextaxd) prateek@t620:~$ flextaxd --version custom_taxonomy_databases: version 0.3.6 Maintaner group: FOI bioinformatics group (bioinformatics@foi.se, david.sundell@foi.se) Github: https://github.com/FOI-Bioinformatics/flextaxd

seqid2taxid.txt nodes.txt

davve2 commented 3 years ago

Sorry now it was my time to be a bit delayed!

I can replicate the error you get above and I´m working on a stable solution. I hope to be done before the weekend but it depends if I can identify exactly what goes wrong.

HitMonk commented 3 years ago

Oh if there is anything that you would like from my end: files used or more versions of programs i am more than happy to provide them. Do let me know if there is anything that i can do to simply it! Thank you for your support as always!

davve2 commented 3 years ago

Thanks. I get a lot of support already and by your approach using FlexTaxD. It is difficult to try out all these things alone so I´m greatful that you are patient, thanks a lot! It would aslo make sense to include a 16S guide, which I now plan introduce after we gone through all the errors!

I think that I identified the problem that you found above. This was something that I changed in a coming version but forgot to change. It was an error introduced while loading the database in CanSNPer format where the root node did not get a link to the rest of the tree hopefully this will solve your problem.

100.00  1788792 3062    R       1       root
 90.89  1625831 91245   R1      2707      Bacteria
 17.76  317687  4598    R2      30835       Proteobacteria
 11.01  196872  10817   R3      35572         Gammaproteobacteria
  3.13  56045   2122    R4      37895           Enterobacterales
  1.35  24122   2998    R5      38282             Enterobacteriaceae
  0.24  4361    1169    R6      38330               Buchnera
  0.01  180     180     R7      38376                 Buchnera_aphidicola_V
  0.01  151     151     R7      38374                 Buchnera_aphidicola_T

For the issue with not having taxonomy levels in the data I´m looking into a couple of options, but it is not straight forward problem to try to predict levels when they are not specified in the file. It may require a "depth" to level file to be added.

I will also add an alternative guide to my 16S walkthrough on how to build the 16S database from the GTDB raw files (just adding the genomes). This will solve the level issue, but I´m not sure that I found all errors yet.

Here is the example I did builiding the database using GTDB files directly as source (And then just the genomes prepared with your awk script (adapted the source file to contain .1 extensions instead of _1)

100.00  1788957 0       R       1       root
100.00  1788957 3059    R1      2         cellular_organism
 90.89  1625996 91246   D       4           Bacteria
 17.76  317692  4597    P       1000000       Proteobacteria
 11.01  196875  10817   C       1000001         Gammaproteobacteria
  3.13  56047   2122    O       1000002           Enterobacterales
  1.35  24122   2998    F       1000003             Enterobacteriaceae
  0.24  4361    1169    G       1002166               Buchnera
  0.01  180     180     S       1018814                 Buchnera aphidicola_V
  0.01  151     151     S       1017382                 Buchnera aphidicola_T

Here are the commands that I used to build the above

Final_commands.txt

HitMonk commented 3 years ago

Hello @davve2, ITS WORKING! I dont understand the blackmagic that you used to pull it off, but its working :) I have also made a few modifications to the commands just for those who are new and would like to follow along (folder names, creation of g2id file, etc.)

Also i had 3 other questions for you: 1) Can i see the file format for GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa? I dont completely follow what the difference is between GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa and GTDB_bac120_arc122_ssu_r95_Species.fa

2) I have also created a bracken database using the kraken2 database. However, when I try to run the bracken program, i get this error. I remember you mentioned that some flag had to be provided. Could you please let me know how i can correct this?

prateek@t620:/exports/watson/Prateek/temp_projects/rick_16s/filtered/trial$ bracken -r 250 -d /exports/watson/Prateek/dbs/16s_gtdb/GTDB_arc_bact_taxo_krakendb/ -i 1766.krep -o 1766.bout -w 1766.brep -l G
 >> Checking for Valid Options...
 >> Running Bracken 
      >> python src/est_abundance.py -i 1766.krep -o 1766.bout -k /exports/watson/Prateek/dbs/16s_gtdb/GTDB_arc_bact_taxo_krakendb/database250mers.kmer_distrib -l G -t 10
PROGRAM START TIME: 03-08-2021 09:59:22
Error: no reads found. Please check your Kraken report

3) Finally, the nodes.dmp file that is created under the taxonomy folder still shows no rank. Is this acceptable?

1   |   1   |   no rank |       |       |   
2   |   1   |   no rank |       |       |   
3   |   2   |   no rank |       |       |   
4   |   3   |   no rank |       |       |   
5   |   4   |   no rank |       |       |   
6   |   5   |   no rank |       |       |   
7   |   6   |   no rank |       |       |

I personally dont think its a big issue, since kraken2-inspect shows the taxonomic ranks pretty well!

# Database options: nucleotide db, k = 35, l = 31
# Spaced mask = 11111111111111111111111111111111110011001100110011001100110011
# Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101
# Total taxonomy nodes: 33066
# Table size: 1789397
# Table capacity: 2603885
# Min clear hash value = 0
100.00  1789397 3061    R   1   root
 90.89  1626413 91289   R1  2707      Bacteria
 17.76  317765  4597    R2  30835       Proteobacteria
 11.01  196933  10820   R3  35572         Gammaproteobacteria
  3.13  56062   2121    R4  37895           Enterobacterales
  1.35  24129   2998    R5  38282             Enterobacteriaceae
  0.24  4361    1169    R6  38330               Buchnera
  0.01  180 180 R7  38376                 Buchnera_aphidicola_V
  0.01  151 151 R7  38374                 Buchnera_aphidicola_T
  0.01  145 145 R7  38333                 Buchnera_aphidicola_AB
  0.01  136 136 R7  38348                 Buchnera_aphidicola_AQ
  0.01  127 127 R7  38337                 Buchnera_aphidicola_AF
  0.01  117 117 R7  38363                 Buchnera_aphidicola_H
  0.01  112 112 R7  38378                 Buchnera_aphidicola_X
  0.01  110 110 R7  38366                 Buchnera_aphidicola_K
  0.01  108 108 R7  38369                 Buchnera_aphidicola_N
  0.01  108 108 R7  38368                 Buchnera_aphidicola_M
  0.01  92  92  R7  38362                 Buchnera_aphidicola_G
  0.00  85  85  R7  38351                 Buchnera_aphidicola_AT

Thank you for all your support! commands_flextaxd_K2.txt

davve2 commented 3 years ago

That is great news!

I think that the levels are required for bracken , so I would suggest to build the database with the second approach presented here 16S walkthough this will preserve the rank levels.

One other option is to export your tree into tab format

flextaxd -db *.db --dump_mini --dump_descriptions

Add columns parent_i child_i rank
Manually add required levels under rank.
rebuild database

I plan to implement a function that allows custom ranks to be added to a non-ranked file, but it is not in my priority list at the moment. And I think guessing the rank based on the number of levels in the a non ranked tree will be too difficult.

HitMonk commented 3 years ago

I havent yet tried the second way, but i would like to. However i dont completely understand what the difference is between GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa and GTDB_bac120_arc122_ssu_r95_Species.fa. The very first command that you use is the: sed -i 's/_1 /.1 /g' GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa Here, is the file just a renamed copy of the GTDB_bac120_arc122_ssu_r95_Species.fa file?

davve2 commented 3 years ago

Ah sorry you asked for that in your last post,

I have implemented the default process processing GTDB to rename the genomes into the standard GCF/GCA name format so the genomes only contain the GCF number and not the source that GTDB uses as well as using .1 not _1. Since I don´t have as good skills as you seem to have in awk the simplest solution for me was to prepare the name in the header before using the awk line to split the files. the _GTDB is the same file but the fasta file header (which in the end is the name of the file) will be

GTDB_bac120_arc122_ssu_r95_Species.fa

>GB_GCF_XXX_1

GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa

>GCF_XXX.1

I suppose one option is to use the --skip_annotation in the step where you build the GTDB databases, and then load the g2id.txt

flextaxd -db base.db -tf base_tree.txt --verbose --logs logs/GTDB
flextaxd -db bac120.db -tf bac120_taxonomy.tsv -tt QIIME --verbose --logs logs/GTDB --skip_annotation
flextaxd -db ar122.db -tf ar122_taxonomy.tsv -tt QIIME --verbose --logs logs/GTDB --skip_annotation

However now this file instead have to be chaged to standard GB/RS format hence GCF_XXXXXX.1 and not GB_GCF_XXXXXX_1 And then before dumping the database add

## Since the 16S database replaced the space in the name with _ we need to reverse this name change to match the node names from GTDB
sed -i -E 's/([a-z].+)_(.+[a-z])/\1 \2/g' g2id_GTDB.txt

flextaxd -db GTDB_merge_v2.db --genomeid2taxid g2id_GTDB.txt --verbose
2021-03-08 16:10:09,220 custom_taxonomy_databases [INFO ]  FlexTaxD logging initiated!
2021-03-08 16:10:09,222 ModifyTree [INFO ]  Modify Tree
2021-03-08 16:10:09,222 DatabaseConnection [INFO ]  GTDB_merge_v2.db opened successfully.
2021-03-08 16:10:09,263 ModifyTree [INFO ]  Taxid base updated!
2021-03-08 16:10:09,263 ModifyTree [INFO ]  Update genome to taxid annotations using g2id_GTDB.txt
2021-03-08 16:10:09,479 ModifyTree [INFO ]  18595 added and 0 genome annotations were updated!
2021-03-08 16:10:09,479 custom_taxonomy_databases [INFO ]  --- Time summary  0 minutes 0.26532745361328125 seconds---

then the files will be treated like any fasta file

davve2 commented 3 years ago

Observe that sed solutions are only suggestions, they probably don´t cover the whole complexity of name changes!

davve2 commented 3 years ago

Example

>GB_GCA_000010565_1 Pelotomaculum thermopropionicum
>GB_GCA_000018565_1 Herpetosiphon aurantiacus
>GB_GCA_000024525_1 Spirosoma linguale
>GB_GCA_000091165_1 Methylomirabilis oxyfera_B
>GB_GCA_000146855_1 Peptoanaerobacter margaretiae
>GB_GCA_000147015_1 Zinderia insecticola
>GB_GCA_000163995_1 Campylobacter_D jejuni_A
>GB_GCA_000165065_1 Longicatena sp000165065
>GB_GCA_000166295_1 Marinobacter adhaerens
>GB_GCA_000168735_1 Endoriftia persephone

File folder

ls GBRS_genomes/ | head
GB_GCA_000007185_1.fasta
GB_GCA_000007345_1.fasta
GB_GCA_000008085_1.fasta
GB_GCA_000010565_1.fasta
GB_GCA_000011125_1.fasta
GB_GCA_000014945_1.fasta
GB_GCA_000015805_1.fasta
GB_GCA_000018465_1.fasta
GB_GCA_000018565_1.fasta
GB_GCA_000019805_1.fasta

I actually removed the first letters by changing the substring in the walkthrough, so in the file I have they are still there, but the _1 is replaced by .1

>GB_GCA_000010565.1 Pelotomaculum thermopropionicum
>GB_GCA_000018565.1 Herpetosiphon aurantiacus
>GB_GCA_000024525.1 Spirosoma linguale
>GB_GCA_000091165.1 Methylomirabilis oxyfera_B
>GB_GCA_000146855.1 Peptoanaerobacter margaretiae
>GB_GCA_000147015.1 Zinderia insecticola
>GB_GCA_000163995.1 Campylobacter_D jejuni_A
>GB_GCA_000165065.1 Longicatena sp000165065
>GB_GCA_000166295.1 Marinobacter adhaerens
>GB_GCA_000168735.1 Endoriftia persephone

File folder

ls GTDB_genomes/ | head
GCA_000007185.1.fasta
GCA_000007345.1.fasta
GCA_000008085.1.fasta
GCA_000010565.1.fasta
GCA_000011125.1.fasta
GCA_000014945.1.fasta
GCA_000015805.1.fasta
GCA_000018465.1.fasta
GCA_000018565.1.fasta
GCA_000019805.1.fasta

HitMonk commented 3 years ago

ah okay, that makes sense. I started working on the version 2 way to setup the database. However, i think i have hit a formatting error with the base_tree.txt file. Here is what i get when i just copy paste the lines required in the base tree file, i have also attached the base_tree.txt file that i used base_tree.txt

(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb/v2$ flextaxd -db base.db -tf base_tree.txt --verbose --logs logs/GTDB
2021-03-08 17:53:47,110 custom_taxonomy_databases [INFO ]  FlexTaxD logging initiated!
Creating a new FlexTaxD database base.db using base_tree.txt, press any key to continue...
2021-03-08 17:53:48,753 custom_taxonomy_databases [INFO ]  Loading module: ReadTaxonomy
2021-03-08 17:53:48,904 DatabaseConnection [INFO ]  base.db opened successfully.
2021-03-08 17:53:48,912 custom_taxonomy_databases [INFO ]  Parse taxonomy
2021-03-08 17:53:48,912 ReadTaxonomy [INFO ]  Read nodes in taxonomy file False
Traceback (most recent call last):
  File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/bin/flextaxd", line 10, in <module>
    sys.exit(main())
  File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/lib/python3.9/site-packages/flextaxd/custom_taxonomy_databases.py", line 267, in main
    read_obj.parse_taxonomy()                                                           ## Parse taxonomy file
  File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/lib/python3.9/site-packages/flextaxd/modules/ReadTaxonomy.py", line 48, in parse_taxonomy
    self.read_nodes(treefile=treefile)
  File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/lib/python3.9/site-packages/flextaxd/modules/ReadTaxonomy.py", line 99, in read_nodes
    raise InputError("Your input tree file does not contain the headers to specify child and parent!")
modules.ReadTaxonomy.InputError: Your input tree file does not contain the headers to specify child and parent!

I think there are issues with tabs and spaces in the base_tree.txt file that is confusing flextaxd? would you be willing to share your base tree file as an attachment? Sorry to bother you! but i do want to get it right, seeing that the version 1 commands work so well!

davve2 commented 3 years ago

Ah yes, there is a format documentation formats where this is explained. While importing a tree in the format that flextaxd uses as its standard (two or three column) a header is required to determine which column is the parent and which is the child. NCBI uses child parent whereas in my experience the parent child order is most common at least in the network community.

Tab delimeted taxonomy tree (FlexTaxD standard format)

The tree must contain headers to determine which of the columns contains the parent and header, the level column is optional

<header>parent                          \tchild                                     \tlevel\n
<node>Francisella tularensis             \tFrancisella tularensis tularensis         \tsubspecies\n
<node>Francisella tularensis tularensis  \tB.6                                       \tsubsubspecies\n
<node>Francisella tularensis             \tFrancisella tularensis holarctica         \tsubspecies\n

HitMonk commented 3 years ago

Ah the tabs were indeed the issue.. I was confused about adding a tab or a space between Bacterial and domain. The version 2 path works much better actually, and is a lot more straightforward.

Unfortunately, i face the same "no reads found in the kraken report" error when i use the bracken database. I used to hit the same error even when i built the whole genome bracken database. However, i didnt really see this error when i built the whole genome database using Struo. Do you have any suggestions on why this might be happening?


bracken -r 250 -d /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/ -i 1766.krep -o 1766.bout -w 1766.brep -l G
 >> Checking for Valid Options...
 >> Running Bracken 
      >> python src/est_abundance.py -i 1766.krep -o 1766.bout -k /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/database250mers.kmer_distrib -l G -t 10
PROGRAM START TIME: 03-09-2021 10:48:02
Error: no reads found. Please check your Kraken report

I can see my kraken report file has the read numbers displayed but for some reason its not being picked up

head 1766.krep 
  0.01  5   5   U   0   unclassified
 99.99  49210   0   R   1   root
 99.99  49210   220 R1  2707      Bacteria
 74.97  36897   0   R2  18158       Firmicutes_A
 74.72  36773   684 R3  18159         Clostridia
 38.86  19125   15  R4  18834           Lachnospirales
 38.71  19050   7481    R5  18897             Lachnospiraceae
  8.82  4343    4175    R6  18991               Blautia_A
  0.28  138 138 R7  19007                 Blautia_A_sp900066165
  0.03  15  15  R7  19011                 Blautia_A_sp900066505

davve2 commented 3 years ago

It still looks like the levels are not correct in the database. Did you base it on GTDB? Instead of F, G, S I can see R5,R6,R7

I have not used bracken myself actually, but perhaps choosing level "R6" instead of G could solve the problem unless bracken requires the standard levels.

HitMonk commented 3 years ago

Yes, i pretty much followed the walk through. Also, as i said i noticed this error when i built whole genome databases with GTDB genomes. It only happens with bracken too,there are no issues with running kraken2. I also dont understand why the levels say R2-R7... that is indeed odd...

The levels do look weird. here is an example where the levels jump from family to genus/ species. I also used R5-R6 but the error stays the same.

kraken2-inspect --db /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/ --use-mpa-style
p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales  1426
p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__GW2011-AR10   356
p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__GW2011-AR10|s__GW2011-AR10 sp000830275    356
p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__Iainarchaeaceae   314
p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__Iainarchaeaceae|g__Iainarchaeum   314
p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__Iainarchaeaceae|g__Iainarchaeum|s__Iainarchaeum andersonii    314
p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__GCA-2688035   311

i havent had a chance yet, but ill try to generate the same with the whole genome datasets. I deleted the last build of my whole genome datasets when i couldnt solve the no reads found error. Well, at the very least i can still use kraken2, and for that i am very grateful. Thank you!

davve2 commented 3 years ago

In my database the levels are kept, can you check how the dump would look from the indivdual databases bac120 and ar122?

Also how does the dump look are the levels in the nodes.dmp, perhaps you forgot to dump the new database before building your kraken2 database (I did that many times).

Using R6 instead of G seems to work for me (as well as using G in the GTDB_orig database that I´ve made

python src/est_abundance.py -i MOCK_kraken2_rawreads_0.0.report -o test.bout -k T_v2_krakendb/database1550mers.kmer_distrib -l R6 -t 0
PROGRAM START TIME: 03-09-2021 11:26:48
BRACKEN SUMMARY (Kraken report: kraken2/MOCK_kraken2_rawreads_0.0.report)
    >>> Threshold: 0
    >>> Number of R6 in sample: 805
          >> Number of R6 with reads > threshold: 805
          >> Number of R6 with reads < threshold: 0
    >>> Total reads in sample: 217994
          >> Total reads kept at R6 level (reads > threshold): 200133
          >> Total reads discarded (R6 reads < threshold): 0
          >> Reads distributed: 0
          >> Reads not distributed (eg. no R6 above threshold): 11930
          >> Unclassified reads: 5931
BRACKEN OUTPUT PRODUCED: test.bout
PROGRAM END TIME: 03-09-2021 11:26:48
  Bracken complete.

davve2 commented 3 years ago

I can see the same jump

kraken2-inspect --db GTDB_orig_krakendb_test2/ --use-mpa-style | grep "p__Iainarchaeota" | head
k__Archaea|p__Iainarchaeota     2597
k__Archaea|p__Iainarchaeota|c__Iainarchaeia     2597
k__Archaea|p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales   1426
k__Archaea|p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__GW2011-AR10    356
k__Archaea|p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__GW2011-AR10|s__GW2011-AR10 sp000830275 356
k__Archaea|p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__Iainarchaeaceae        314
k__Archaea|p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__Iainarchaeaceae|g__Iainarchaeum        314
k__Archaea|p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__Iainarchaeaceae|g__Iainarchaeum|s__Iainarchaeum andersonii     314
k__Archaea|p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__GCA-2688035    311

davve2 commented 3 years ago

It certainly looks like an error

GB_GCA_013332025.1      d__Archaea;p__Iainarchaeota;c__Iainarchaeia;o__Iainarchaeales;f__GW2011-AR10;g__GW2011-AR10;s__GW2011-AR10 sp000830275
GB_GCA_000830275.1      d__Archaea;p__Iainarchaeota;c__Iainarchaeia;o__Iainarchaeales;f__GW2011-AR10;g__GW2011-AR10;s__GW2011-AR10 sp000830275

But it is caused by identical names for the species and family nodes I have noticed this problem but have not yet implemented a solution #37

fGW2011-AR10;gGW2011-AR10

HitMonk commented 3 years ago

huh strange. Ill just rebuild the database again. but my taxonomy files look clean;

flextaxd -db ../bac120.db --dump -o trial --dbprogram kraken2 --verbose --logs logs/GTDB
2021-03-09 12:39:21,393 custom_taxonomy_databases [INFO ]  FlexTaxD logging initiated!
Warning: {names} and/or {nodes} already exists, overwrite? (y/n): y
2021-03-09 12:39:23,259 custom_taxonomy_databases [INFO ]  Loading module: WriteTaxonomy
2021-03-09 12:39:23,271 DatabaseConnection [INFO ]  ../bac120.db opened successfully.
2021-03-09 12:39:23,271 WriteTaxonomy [INFO ]  Write tree to: trial/nodes.dmp
2021-03-09 12:39:23,639 WriteTaxonomy [INFO ]  Write annotations to: trial/names.dmp
2021-03-09 12:39:23,968 custom_taxonomy_databases [INFO ]  --- Time summary  0 minutes 2.591289520263672 seconds---
(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb$ more trial/nodes.dmp 
1   |   1   |   no rank |       |       |   
2   |   2   |   no rank |       |       |   
3   |   2   |   no rank |       |       |   
4   |   3   |   superkingdom    |       |       |   
5   |   3   |   superkingdom    |       |       |   
6   |   3   |   superkingdom    |       |       |   
7   |   2   |   superkingdom    |       |       |   
8   |   2   |   no rank |       |       |   
9   |   2   |   no rank |       |       |   
10  |   4   |   phylum  |       |       |   
11  |   10  |   class   |       |       |   
12  |   11  |   order   |       |       |   
13  |   12  |   family  |       |       |   
14  |   13  |   genus   |       |       |

HitMonk commented 3 years ago

i rebuilt the database, and now i dont see the R4-R7 in my report files anymore! its the normal initials. I however hit the same read error issue though... How did you build your bracken database btw? Maybe its because im using the default bracken-build command?

1) Build bracken database: Copy over genomes to library, and unzip seqid2taxid.map.gz

cp -r genomes/ GTDB_orig_krakendb/
cd GTDB_orig_krakendb/; mv genomes/ library
gunzip seqid2taxid.map.gz

2) Build bracken database:

bracken-build -d GTDB_orig_krakendb/  -l 250  
 >> Selected Options:
       kmer length = 35
       read length = 250
       database    = GTDB_orig_krakendb/
       threads     = 1
 >> Checking for Valid Options...
 >> Creating database.kraken [if not found]
          database.kraken exists, skipping creation....
          Finished creating database.kraken [in DB folder]
 >> Creating database250mers.kmer_distrib 
    >>STEP 0: PARSING COMMAND LINE ARGUMENTS
        Taxonomy nodes file: GTDB_orig_krakendb/taxonomy/nodes.dmp
        Seqid file:          GTDB_orig_krakendb/seqid2taxid.map
        Num Threads:         1
        Kmer Length:         35
        Read Length:         250
    >>STEP 1: READING SEQID2TAXID MAP
        23091 total sequences read
    >>STEP 2: READING NODES.DMP FILE
        43029 total nodes read
    >>STEP 3: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
        250mers, with a database built using 35mers
        23093 sequences converted (finished: )S_GCF_000576305.1)
    Time Elaped: 0 minutes, 37 seconds, 0.00000 microseconds
    =============================
PROGRAM START TIME: 03-09-2021 11:57:43
...1 total genomes read from kraken output file
...creating kmer counts file -- lists the number of kmers of each classification per genome
...creating kmer distribution file -- lists genomes and kmer counts contributing to each genome
PROGRAM END TIME: 03-09-2021 11:57:43
          Finished creating database250mers.kraken and database250mers.kmer_distrib [in DB folder]
          *NOTE: to create read distribution files for multiple read lengths, 
                 rerun this script specifying the same database but a different read length

Bracken build complete.

3) generate kraken and bracken results:

kraken2 --db /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/ --threads 10 --output 1766.k2 --report 1766.krep 22002005521766_F_filt.fastq 
Loading database information... done.
49215 sequences (12.30 Mbp) processed in 0.295s (10013.1 Kseq/m, 2503.27 Mbp/m).
  49210 sequences classified (99.99%)
  5 sequences unclassified (0.01%)
(flextaxd) prateek@t620:/exports/watson/Prateek/temp_projects/rick_16s/filtered/trial$ bracken -r 250 -d /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/ -i 1766.k2 -o 1766.bout -w 1766.brep -l R5
 >> Checking for Valid Options...
 >> Running Bracken 
      >> python src/est_abundance.py -i 1766.k2 -o 1766.bout -k /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/database250mers.kmer_distrib -l R5 -t 10
PROGRAM START TIME: 03-09-2021 12:07:23
Error: no reads found. Please check your Kraken report
(flextaxd) prateek@t620:/exports/watson/Prateek/temp_projects/rick_16s/filtered/trial$ bracken -r 250 -d /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/ -i 1766.k2 -o 1766.bout -w 1766.brep -l G
 >> Checking for Valid Options...
 >> Running Bracken 
      >> python src/est_abundance.py -i 1766.k2 -o 1766.bout -k /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/database250mers.kmer_distrib -l G -t 10
PROGRAM START TIME: 03-09-2021 12:07:26
Error: no reads found. Please check your Kraken report

4) view the report file:

 0.01   5   5   U   0   unclassified
 99.99  49210   0   R   1   root
 99.99  49210   0   R1  2     cellular_organism
 99.99  49210   220 R2  4       Bacteria
 74.97  36897   0   P   1000034       Firmicutes_A
 74.72  36773   684 C   1000035         Clostridia
 38.86  19125   15  O   1000404           Lachnospirales
 38.71  19050   7487    F   1000405             Lachnospiraceae
  8.82  4343    4175    G   1000626               Blautia_A
  0.28  138 138 S   1003577                 Blautia_A sp900066165

It is good news that you can use it, that just means im messing up somewhere... :(

davve2 commented 3 years ago

I´m not sure if there is any difference on moving the genomes back into the kraken or just keeping the library?

If you add --keep when running flextaxd-create both the taxonomy directory and the library will be kept and you can run bracken-build directly.

HitMonk commented 3 years ago

Ok, i dont understand the why, but rebuilding the database with --keep did the trick! I have kraken2+bracken pipeline working flawlessly now! I cant thank you enough for all your support. Please do let me know if you think i can help in the future with any kind of testing at all!

If you were open to a suggestion, i would suggest to reach out to the GTDB people and mention that you have got a working pipeline to build kraken2 compatible 16s GTDB database. I will also spread the word :)

Hope you have a great week ahead!

until next time :)

davve2 commented 3 years ago

That is great news :D

Excellent, I will update the walkthrough with some more explanatory text and make sure the --keep option is also well displayed. It would be great if you could have a look at it later and see if it is easy to follow, since you now have some experience of details and where things can easilly be misunderstood!

That is a good suggestion thank you, I will try to form the walkthrough after the process we went through here!

Thanks a lot and good luck with the downstream analysis!

HitMonk commented 3 years ago

I can do that, Ill also ask a couple of my friends to try it out and make sure it works in with different servers.

speaking on downstream analysis, do you use kraken-biom to merge all the report files? Im not sure why, but the number of OTUs drops a lot for me once i use kraken-biom to merge the reports. I generally load the biom file into phyloseq. Do you have an alternate suggestion?

davve2 commented 3 years ago

That sounds wonderful thanks a lot!

I have not used kraken2 for 16S analysis yet. Although this is a future task, at the moment I can´t give a good answer to your OTU question at the moment.