Closed davve2 closed 3 years ago
Hi, thanks for the report! I can unfortunately not reproduce this issue.
Is is possible for you to try installing flextaxd with conda again (It may not have been updated properly since there is some delay before the update is found by conda). It was working for me this morning with this version! I would suggest to create a new environment conda create -n flextaxd flextaxd
If you still have the same issue I will have to find a way to reproduce the issue.
Update:
I tried your solution from three different servers and can still not reproduce the error above. (Using the exact commands posted above) and the GTDB_arc_bact_taxo.txt
source file.
I also tried creating a python 3.8 conda environment with flextaxd and no issues (default python with conda is currently python 3.9)
@HitMonk Any progress?
Hello @davve2 , Apologies for the delayed response! I installed FlextaxD through conda and the previous error disappeared. i ran through all the commands that you provided and seem to face the empty kraken2 database issue. The kraken database seems to be created but it looks like its empty. below is the log:
(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ flextaxd-create -db rRNA_custom_tree --create_db --db_name krakendb --genomes_path genomes/ --dbprogram kraken2 -o taxonomy --verbose --processes 8
2021-03-02 19:42:04,606 create_databases [INFO ] FlexTaxD-create logging initiated!
2021-03-02 19:42:04,621 create_databases [INFO ] Processing files; create kraken seq.map
2021-03-02 19:42:04,622 DatabaseConnection [INFO ] rRNA_custom_tree opened successfully.
2021-03-02 19:42:04,686 ProcessDirectory [INFO ] Number of genomes annotated in database 23091
2021-03-02 19:42:04,686 ProcessDirectory [INFO ] Process genome path (genomes/)
2021-03-02 19:42:04,902 ProcessDirectory [INFO ] Processed 23092 genomes
2021-03-02 19:42:04,914 create_databases [INFO ] Loading module: CreateKrakenDatabase
2021-03-02 19:42:04,937 DatabaseConnection [INFO ] rRNA_custom_tree opened successfully.
2021-03-02 19:42:05,004 CreateKrakenDatabase [INFO ] krakendb
2021-03-02 19:42:05,013 create_databases [INFO ] --- process finished in 0 minutes 0.41902661323547363 seconds---
2021-03-02 19:42:05,015 CreateKrakenDatabase [INFO ] Create library directory
2021-03-02 19:42:05,017 CreateKrakenDatabase [INFO ] Processing files; create kraken seq.map
2021-03-02 19:44:55,436 CreateKrakenDatabase [INFO ] Number of genomes succesfully added to the kraken2 database: 23091
2021-03-02 19:44:55,437 create_databases [INFO ] Genome folder preprocessing completed!
2021-03-02 19:44:55,437 create_databases [INFO ] --- process finished in 2 minutes 50.843384981155396 seconds---
2021-03-02 19:44:55,437 create_databases [INFO ] Create database
2021-03-02 19:44:55,438 CreateKrakenDatabase [INFO ] mkdir -p krakendb/taxonomy
2021-03-02 19:44:55,447 CreateKrakenDatabase [INFO ] cp taxonomy/*.dmp krakendb/taxonomy
2021-03-02 19:44:55,555 CreateKrakenDatabase [INFO ] cp taxonomy/*.map krakendb
2021-03-02 19:44:55,555 CreateKrakenDatabase [INFO ] kraken2-build --build --db krakendb --threads 8
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map complete. [0.078s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 10415540 bytes
Capacity estimation complete. [1.089s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 16 bits reserved for taxid.
Completed processing of 0 sequences, 0 bp
Writing data to disk... complete.
Database files completed. [0.368s]
Database construction complete. [Total: 1.561s]
2021-03-02 19:44:57,207 CreateKrakenDatabase [INFO ] Create inspect file!
Database disk usage: 12K
After cleaning, database uses 5.5K
2021-03-02 19:44:57,613 CreateKrakenDatabase [INFO ] kraken2 database created
2021-03-02 19:44:57,614 create_databases [INFO ] --- Time summary 2 minutes 53.020384311676025 seconds---
But when i look at the database: it looks like its empty, just like before:
(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ kraken2-inspect --db krakendb/ --use-mpa-style
# Database options: nucleotide db, k = 35, l = 31
# Spaced mask = 11111111111111111111111111111111110011001100110011001100110011
# Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101
# Total taxonomy nodes: 33067
# Table size: 0
# Table capacity: 2603885
# Min clear hash value = 0
Also, the node.dmp file still has the taxonomy encoded as no rank as before, but the seqid2taxid.map file seems like its finding all the correct files? It looks like the Step 3 in kraken building isnt working. For me it shows 0 genomes and 0 bps build, however, your log shows
Completed processing of 23091 sequences, 31017889 bp
Do you have any suggestions on where im going wrong? The other commands that i used prior to building the kraken2 database above are below:
(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ flextaxd -db rRNA_custom_tree -tf taxonomy_unique.txt -tt CanSNPer -o taxonomy --dump --dbprogram kraken2
Creating a new FlexTaxD database rRNA_custom_tree using taxonomy_unique.txt, press any key to continue...
2021-03-02 20:06:17,324 custom_taxonomy_databases [WARNI] Warning no genomeid2taxid file given!
(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb$ flextaxd -db rRNA_custom_tree --genomeid2taxid genomeid2taxid.txt
i have also attached the seqid2taxid.map and node.dmp files, the extensions are changed to .txt so i can upload it :)
Finally, the version of flextaxd for sanity check, (flextaxd) prateek@t620:~$ flextaxd --version custom_taxonomy_databases: version 0.3.6 Maintaner group: FOI bioinformatics group (bioinformatics@foi.se, david.sundell@foi.se) Github: https://github.com/FOI-Bioinformatics/flextaxd
Sorry now it was my time to be a bit delayed!
I can replicate the error you get above and I´m working on a stable solution. I hope to be done before the weekend but it depends if I can identify exactly what goes wrong.
Oh if there is anything that you would like from my end: files used or more versions of programs i am more than happy to provide them. Do let me know if there is anything that i can do to simply it! Thank you for your support as always!
Thanks. I get a lot of support already and by your approach using FlexTaxD. It is difficult to try out all these things alone so I´m greatful that you are patient, thanks a lot! It would aslo make sense to include a 16S guide, which I now plan introduce after we gone through all the errors!
I think that I identified the problem that you found above. This was something that I changed in a coming version but forgot to change. It was an error introduced while loading the database in CanSNPer format where the root node did not get a link to the rest of the tree hopefully this will solve your problem.
100.00 1788792 3062 R 1 root
90.89 1625831 91245 R1 2707 Bacteria
17.76 317687 4598 R2 30835 Proteobacteria
11.01 196872 10817 R3 35572 Gammaproteobacteria
3.13 56045 2122 R4 37895 Enterobacterales
1.35 24122 2998 R5 38282 Enterobacteriaceae
0.24 4361 1169 R6 38330 Buchnera
0.01 180 180 R7 38376 Buchnera_aphidicola_V
0.01 151 151 R7 38374 Buchnera_aphidicola_T
For the issue with not having taxonomy levels in the data I´m looking into a couple of options, but it is not straight forward problem to try to predict levels when they are not specified in the file. It may require a "depth" to level file to be added.
I will also add an alternative guide to my 16S walkthrough on how to build the 16S database from the GTDB raw files (just adding the genomes). This will solve the level issue, but I´m not sure that I found all errors yet.
Here is the example I did builiding the database using GTDB files directly as source (And then just the genomes prepared with your awk script (adapted the source file to contain .1 extensions instead of _1)
100.00 1788957 0 R 1 root
100.00 1788957 3059 R1 2 cellular_organism
90.89 1625996 91246 D 4 Bacteria
17.76 317692 4597 P 1000000 Proteobacteria
11.01 196875 10817 C 1000001 Gammaproteobacteria
3.13 56047 2122 O 1000002 Enterobacterales
1.35 24122 2998 F 1000003 Enterobacteriaceae
0.24 4361 1169 G 1002166 Buchnera
0.01 180 180 S 1018814 Buchnera aphidicola_V
0.01 151 151 S 1017382 Buchnera aphidicola_T
Here are the commands that I used to build the above
Hello @davve2, ITS WORKING! I dont understand the blackmagic that you used to pull it off, but its working :) I have also made a few modifications to the commands just for those who are new and would like to follow along (folder names, creation of g2id file, etc.)
Also i had 3 other questions for you: 1) Can i see the file format for GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa? I dont completely follow what the difference is between GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa and GTDB_bac120_arc122_ssu_r95_Species.fa
2) I have also created a bracken database using the kraken2 database. However, when I try to run the bracken program, i get this error. I remember you mentioned that some flag had to be provided. Could you please let me know how i can correct this?
prateek@t620:/exports/watson/Prateek/temp_projects/rick_16s/filtered/trial$ bracken -r 250 -d /exports/watson/Prateek/dbs/16s_gtdb/GTDB_arc_bact_taxo_krakendb/ -i 1766.krep -o 1766.bout -w 1766.brep -l G
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i 1766.krep -o 1766.bout -k /exports/watson/Prateek/dbs/16s_gtdb/GTDB_arc_bact_taxo_krakendb/database250mers.kmer_distrib -l G -t 10
PROGRAM START TIME: 03-08-2021 09:59:22
Error: no reads found. Please check your Kraken report
3) Finally, the nodes.dmp file that is created under the taxonomy folder still shows no rank. Is this acceptable?
1 | 1 | no rank | | |
2 | 1 | no rank | | |
3 | 2 | no rank | | |
4 | 3 | no rank | | |
5 | 4 | no rank | | |
6 | 5 | no rank | | |
7 | 6 | no rank | | |
I personally dont think its a big issue, since kraken2-inspect shows the taxonomic ranks pretty well!
# Database options: nucleotide db, k = 35, l = 31
# Spaced mask = 11111111111111111111111111111111110011001100110011001100110011
# Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101
# Total taxonomy nodes: 33066
# Table size: 1789397
# Table capacity: 2603885
# Min clear hash value = 0
100.00 1789397 3061 R 1 root
90.89 1626413 91289 R1 2707 Bacteria
17.76 317765 4597 R2 30835 Proteobacteria
11.01 196933 10820 R3 35572 Gammaproteobacteria
3.13 56062 2121 R4 37895 Enterobacterales
1.35 24129 2998 R5 38282 Enterobacteriaceae
0.24 4361 1169 R6 38330 Buchnera
0.01 180 180 R7 38376 Buchnera_aphidicola_V
0.01 151 151 R7 38374 Buchnera_aphidicola_T
0.01 145 145 R7 38333 Buchnera_aphidicola_AB
0.01 136 136 R7 38348 Buchnera_aphidicola_AQ
0.01 127 127 R7 38337 Buchnera_aphidicola_AF
0.01 117 117 R7 38363 Buchnera_aphidicola_H
0.01 112 112 R7 38378 Buchnera_aphidicola_X
0.01 110 110 R7 38366 Buchnera_aphidicola_K
0.01 108 108 R7 38369 Buchnera_aphidicola_N
0.01 108 108 R7 38368 Buchnera_aphidicola_M
0.01 92 92 R7 38362 Buchnera_aphidicola_G
0.00 85 85 R7 38351 Buchnera_aphidicola_AT
Thank you for all your support! commands_flextaxd_K2.txt
That is great news!
I think that the levels are required for bracken , so I would suggest to build the database with the second approach presented here 16S walkthough this will preserve the rank levels.
One other option is to export your tree into tab format
flextaxd -db *.db --dump_mini --dump_descriptions
I plan to implement a function that allows custom ranks to be added to a non-ranked file, but it is not in my priority list at the moment. And I think guessing the rank based on the number of levels in the a non ranked tree will be too difficult.
I havent yet tried the second way, but i would like to. However i dont completely understand what the difference is between GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa and GTDB_bac120_arc122_ssu_r95_Species.fa.
The very first command that you use is the:
sed -i 's/_1 /.1 /g' GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa
Here, is the file just a renamed copy of the GTDB_bac120_arc122_ssu_r95_Species.fa file?
Ah sorry you asked for that in your last post,
I have implemented the default process processing GTDB to rename the genomes into the standard GCF/GCA name format so the genomes only contain the GCF number and not the source that GTDB uses as well as using .1 not _1. Since I don´t have as good skills as you seem to have in awk the simplest solution for me was to prepare the name in the header before using the awk line to split the files. the _GTDB is the same file but the fasta file header (which in the end is the name of the file) will be
GTDB_bac120_arc122_ssu_r95_Species.fa
>GB_GCF_XXX_1
GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa
>GCF_XXX.1
I suppose one option is to use the --skip_annotation in the step where you build the GTDB databases, and then load the g2id.txt
flextaxd -db base.db -tf base_tree.txt --verbose --logs logs/GTDB
flextaxd -db bac120.db -tf bac120_taxonomy.tsv -tt QIIME --verbose --logs logs/GTDB --skip_annotation
flextaxd -db ar122.db -tf ar122_taxonomy.tsv -tt QIIME --verbose --logs logs/GTDB --skip_annotation
However now this file instead have to be chaged to standard GB/RS format hence GCF_XXXXXX.1 and not GB_GCF_XXXXXX_1 And then before dumping the database add
## Since the 16S database replaced the space in the name with _ we need to reverse this name change to match the node names from GTDB
sed -i -E 's/([a-z].+)_(.+[a-z])/\1 \2/g' g2id_GTDB.txt
flextaxd -db GTDB_merge_v2.db --genomeid2taxid g2id_GTDB.txt --verbose
2021-03-08 16:10:09,220 custom_taxonomy_databases [INFO ] FlexTaxD logging initiated!
2021-03-08 16:10:09,222 ModifyTree [INFO ] Modify Tree
2021-03-08 16:10:09,222 DatabaseConnection [INFO ] GTDB_merge_v2.db opened successfully.
2021-03-08 16:10:09,263 ModifyTree [INFO ] Taxid base updated!
2021-03-08 16:10:09,263 ModifyTree [INFO ] Update genome to taxid annotations using g2id_GTDB.txt
2021-03-08 16:10:09,479 ModifyTree [INFO ] 18595 added and 0 genome annotations were updated!
2021-03-08 16:10:09,479 custom_taxonomy_databases [INFO ] --- Time summary 0 minutes 0.26532745361328125 seconds---
then the files will be treated like any fasta file
Observe that sed solutions are only suggestions, they probably don´t cover the whole complexity of name changes!
>GB_GCA_000010565_1 Pelotomaculum thermopropionicum
>GB_GCA_000018565_1 Herpetosiphon aurantiacus
>GB_GCA_000024525_1 Spirosoma linguale
>GB_GCA_000091165_1 Methylomirabilis oxyfera_B
>GB_GCA_000146855_1 Peptoanaerobacter margaretiae
>GB_GCA_000147015_1 Zinderia insecticola
>GB_GCA_000163995_1 Campylobacter_D jejuni_A
>GB_GCA_000165065_1 Longicatena sp000165065
>GB_GCA_000166295_1 Marinobacter adhaerens
>GB_GCA_000168735_1 Endoriftia persephone
File folder
ls GBRS_genomes/ | head
GB_GCA_000007185_1.fasta
GB_GCA_000007345_1.fasta
GB_GCA_000008085_1.fasta
GB_GCA_000010565_1.fasta
GB_GCA_000011125_1.fasta
GB_GCA_000014945_1.fasta
GB_GCA_000015805_1.fasta
GB_GCA_000018465_1.fasta
GB_GCA_000018565_1.fasta
GB_GCA_000019805_1.fasta
I actually removed the first letters by changing the substring in the walkthrough, so in the file I have they are still there, but the _1 is replaced by .1
>GB_GCA_000010565.1 Pelotomaculum thermopropionicum
>GB_GCA_000018565.1 Herpetosiphon aurantiacus
>GB_GCA_000024525.1 Spirosoma linguale
>GB_GCA_000091165.1 Methylomirabilis oxyfera_B
>GB_GCA_000146855.1 Peptoanaerobacter margaretiae
>GB_GCA_000147015.1 Zinderia insecticola
>GB_GCA_000163995.1 Campylobacter_D jejuni_A
>GB_GCA_000165065.1 Longicatena sp000165065
>GB_GCA_000166295.1 Marinobacter adhaerens
>GB_GCA_000168735.1 Endoriftia persephone
File folder
ls GTDB_genomes/ | head
GCA_000007185.1.fasta
GCA_000007345.1.fasta
GCA_000008085.1.fasta
GCA_000010565.1.fasta
GCA_000011125.1.fasta
GCA_000014945.1.fasta
GCA_000015805.1.fasta
GCA_000018465.1.fasta
GCA_000018565.1.fasta
GCA_000019805.1.fasta
ah okay, that makes sense. I started working on the version 2 way to setup the database. However, i think i have hit a formatting error with the base_tree.txt file. Here is what i get when i just copy paste the lines required in the base tree file, i have also attached the base_tree.txt file that i used base_tree.txt
(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb/v2$ flextaxd -db base.db -tf base_tree.txt --verbose --logs logs/GTDB
2021-03-08 17:53:47,110 custom_taxonomy_databases [INFO ] FlexTaxD logging initiated!
Creating a new FlexTaxD database base.db using base_tree.txt, press any key to continue...
2021-03-08 17:53:48,753 custom_taxonomy_databases [INFO ] Loading module: ReadTaxonomy
2021-03-08 17:53:48,904 DatabaseConnection [INFO ] base.db opened successfully.
2021-03-08 17:53:48,912 custom_taxonomy_databases [INFO ] Parse taxonomy
2021-03-08 17:53:48,912 ReadTaxonomy [INFO ] Read nodes in taxonomy file False
Traceback (most recent call last):
File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/bin/flextaxd", line 10, in <module>
sys.exit(main())
File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/lib/python3.9/site-packages/flextaxd/custom_taxonomy_databases.py", line 267, in main
read_obj.parse_taxonomy() ## Parse taxonomy file
File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/lib/python3.9/site-packages/flextaxd/modules/ReadTaxonomy.py", line 48, in parse_taxonomy
self.read_nodes(treefile=treefile)
File "/exports/watson/Prateek/apps/miniconda3/envs/flextaxd/lib/python3.9/site-packages/flextaxd/modules/ReadTaxonomy.py", line 99, in read_nodes
raise InputError("Your input tree file does not contain the headers to specify child and parent!")
modules.ReadTaxonomy.InputError: Your input tree file does not contain the headers to specify child and parent!
I think there are issues with tabs and spaces in the base_tree.txt file that is confusing flextaxd? would you be willing to share your base tree file as an attachment? Sorry to bother you! but i do want to get it right, seeing that the version 1 commands work so well!
Ah yes, there is a format documentation formats where this is explained. While importing a tree in the format that flextaxd uses as its standard (two or three column) a header is required to determine which column is the parent and which is the child. NCBI uses child parent whereas in my experience the parent child order is most common at least in the network community.
The tree must contain headers to determine which of the columns contains the parent and header, the level column is optional
<header>parent \tchild \tlevel\n
<node>Francisella tularensis \tFrancisella tularensis tularensis \tsubspecies\n
<node>Francisella tularensis tularensis \tB.6 \tsubsubspecies\n
<node>Francisella tularensis \tFrancisella tularensis holarctica \tsubspecies\n
Ah the tabs were indeed the issue.. I was confused about adding a tab or a space between Bacterial and domain. The version 2 path works much better actually, and is a lot more straightforward.
Unfortunately, i face the same "no reads found in the kraken report" error when i use the bracken database. I used to hit the same error even when i built the whole genome bracken database. However, i didnt really see this error when i built the whole genome database using Struo. Do you have any suggestions on why this might be happening?
bracken -r 250 -d /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/ -i 1766.krep -o 1766.bout -w 1766.brep -l G
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i 1766.krep -o 1766.bout -k /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/database250mers.kmer_distrib -l G -t 10
PROGRAM START TIME: 03-09-2021 10:48:02
Error: no reads found. Please check your Kraken report
I can see my kraken report file has the read numbers displayed but for some reason its not being picked up
head 1766.krep
0.01 5 5 U 0 unclassified
99.99 49210 0 R 1 root
99.99 49210 220 R1 2707 Bacteria
74.97 36897 0 R2 18158 Firmicutes_A
74.72 36773 684 R3 18159 Clostridia
38.86 19125 15 R4 18834 Lachnospirales
38.71 19050 7481 R5 18897 Lachnospiraceae
8.82 4343 4175 R6 18991 Blautia_A
0.28 138 138 R7 19007 Blautia_A_sp900066165
0.03 15 15 R7 19011 Blautia_A_sp900066505
It still looks like the levels are not correct in the database. Did you base it on GTDB? Instead of F, G, S I can see R5,R6,R7
I have not used bracken myself actually, but perhaps choosing level "R6" instead of G could solve the problem unless bracken requires the standard levels.
Yes, i pretty much followed the walk through. Also, as i said i noticed this error when i built whole genome databases with GTDB genomes. It only happens with bracken too,there are no issues with running kraken2. I also dont understand why the levels say R2-R7... that is indeed odd...
The levels do look weird. here is an example where the levels jump from family to genus/ species. I also used R5-R6 but the error stays the same.
kraken2-inspect --db /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/ --use-mpa-style
p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales 1426
p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__GW2011-AR10 356
p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__GW2011-AR10|s__GW2011-AR10 sp000830275 356
p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__Iainarchaeaceae 314
p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__Iainarchaeaceae|g__Iainarchaeum 314
p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__Iainarchaeaceae|g__Iainarchaeum|s__Iainarchaeum andersonii 314
p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__GCA-2688035 311
i havent had a chance yet, but ill try to generate the same with the whole genome datasets. I deleted the last build of my whole genome datasets when i couldnt solve the no reads found error. Well, at the very least i can still use kraken2, and for that i am very grateful. Thank you!
In my database the levels are kept, can you check how the dump would look from the indivdual databases bac120 and ar122?
Also how does the dump look are the levels in the nodes.dmp, perhaps you forgot to dump the new database before building your kraken2 database (I did that many times).
Using R6 instead of G seems to work for me (as well as using G in the GTDB_orig database that I´ve made
python src/est_abundance.py -i MOCK_kraken2_rawreads_0.0.report -o test.bout -k T_v2_krakendb/database1550mers.kmer_distrib -l R6 -t 0
PROGRAM START TIME: 03-09-2021 11:26:48
BRACKEN SUMMARY (Kraken report: kraken2/MOCK_kraken2_rawreads_0.0.report)
>>> Threshold: 0
>>> Number of R6 in sample: 805
>> Number of R6 with reads > threshold: 805
>> Number of R6 with reads < threshold: 0
>>> Total reads in sample: 217994
>> Total reads kept at R6 level (reads > threshold): 200133
>> Total reads discarded (R6 reads < threshold): 0
>> Reads distributed: 0
>> Reads not distributed (eg. no R6 above threshold): 11930
>> Unclassified reads: 5931
BRACKEN OUTPUT PRODUCED: test.bout
PROGRAM END TIME: 03-09-2021 11:26:48
Bracken complete.
I can see the same jump
kraken2-inspect --db GTDB_orig_krakendb_test2/ --use-mpa-style | grep "p__Iainarchaeota" | head
k__Archaea|p__Iainarchaeota 2597
k__Archaea|p__Iainarchaeota|c__Iainarchaeia 2597
k__Archaea|p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales 1426
k__Archaea|p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__GW2011-AR10 356
k__Archaea|p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__GW2011-AR10|s__GW2011-AR10 sp000830275 356
k__Archaea|p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__Iainarchaeaceae 314
k__Archaea|p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__Iainarchaeaceae|g__Iainarchaeum 314
k__Archaea|p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__Iainarchaeaceae|g__Iainarchaeum|s__Iainarchaeum andersonii 314
k__Archaea|p__Iainarchaeota|c__Iainarchaeia|o__Iainarchaeales|f__GCA-2688035 311
It certainly looks like an error
GB_GCA_013332025.1 d__Archaea;p__Iainarchaeota;c__Iainarchaeia;o__Iainarchaeales;f__GW2011-AR10;g__GW2011-AR10;s__GW2011-AR10 sp000830275
GB_GCA_000830275.1 d__Archaea;p__Iainarchaeota;c__Iainarchaeia;o__Iainarchaeales;f__GW2011-AR10;g__GW2011-AR10;s__GW2011-AR10 sp000830275
But it is caused by identical names for the species and family nodes I have noticed this problem but have not yet implemented a solution #37
fGW2011-AR10;gGW2011-AR10
huh strange. Ill just rebuild the database again. but my taxonomy files look clean;
flextaxd -db ../bac120.db --dump -o trial --dbprogram kraken2 --verbose --logs logs/GTDB
2021-03-09 12:39:21,393 custom_taxonomy_databases [INFO ] FlexTaxD logging initiated!
Warning: {names} and/or {nodes} already exists, overwrite? (y/n): y
2021-03-09 12:39:23,259 custom_taxonomy_databases [INFO ] Loading module: WriteTaxonomy
2021-03-09 12:39:23,271 DatabaseConnection [INFO ] ../bac120.db opened successfully.
2021-03-09 12:39:23,271 WriteTaxonomy [INFO ] Write tree to: trial/nodes.dmp
2021-03-09 12:39:23,639 WriteTaxonomy [INFO ] Write annotations to: trial/names.dmp
2021-03-09 12:39:23,968 custom_taxonomy_databases [INFO ] --- Time summary 0 minutes 2.591289520263672 seconds---
(flextaxd) prateek@t620:/exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb$ more trial/nodes.dmp
1 | 1 | no rank | | |
2 | 2 | no rank | | |
3 | 2 | no rank | | |
4 | 3 | superkingdom | | |
5 | 3 | superkingdom | | |
6 | 3 | superkingdom | | |
7 | 2 | superkingdom | | |
8 | 2 | no rank | | |
9 | 2 | no rank | | |
10 | 4 | phylum | | |
11 | 10 | class | | |
12 | 11 | order | | |
13 | 12 | family | | |
14 | 13 | genus | | |
i rebuilt the database, and now i dont see the R4-R7 in my report files anymore! its the normal initials. I however hit the same read error issue though... How did you build your bracken database btw? Maybe its because im using the default bracken-build command?
1) Build bracken database: Copy over genomes to library, and unzip seqid2taxid.map.gz
cp -r genomes/ GTDB_orig_krakendb/
cd GTDB_orig_krakendb/; mv genomes/ library
gunzip seqid2taxid.map.gz
2) Build bracken database:
bracken-build -d GTDB_orig_krakendb/ -l 250
>> Selected Options:
kmer length = 35
read length = 250
database = GTDB_orig_krakendb/
threads = 1
>> Checking for Valid Options...
>> Creating database.kraken [if not found]
database.kraken exists, skipping creation....
Finished creating database.kraken [in DB folder]
>> Creating database250mers.kmer_distrib
>>STEP 0: PARSING COMMAND LINE ARGUMENTS
Taxonomy nodes file: GTDB_orig_krakendb/taxonomy/nodes.dmp
Seqid file: GTDB_orig_krakendb/seqid2taxid.map
Num Threads: 1
Kmer Length: 35
Read Length: 250
>>STEP 1: READING SEQID2TAXID MAP
23091 total sequences read
>>STEP 2: READING NODES.DMP FILE
43029 total nodes read
>>STEP 3: CONVERTING KMER MAPPINGS INTO READ CLASSIFICATIONS:
250mers, with a database built using 35mers
23093 sequences converted (finished: )S_GCF_000576305.1)
Time Elaped: 0 minutes, 37 seconds, 0.00000 microseconds
=============================
PROGRAM START TIME: 03-09-2021 11:57:43
...1 total genomes read from kraken output file
...creating kmer counts file -- lists the number of kmers of each classification per genome
...creating kmer distribution file -- lists genomes and kmer counts contributing to each genome
PROGRAM END TIME: 03-09-2021 11:57:43
Finished creating database250mers.kraken and database250mers.kmer_distrib [in DB folder]
*NOTE: to create read distribution files for multiple read lengths,
rerun this script specifying the same database but a different read length
Bracken build complete.
3) generate kraken and bracken results:
kraken2 --db /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/ --threads 10 --output 1766.k2 --report 1766.krep 22002005521766_F_filt.fastq
Loading database information... done.
49215 sequences (12.30 Mbp) processed in 0.295s (10013.1 Kseq/m, 2503.27 Mbp/m).
49210 sequences classified (99.99%)
5 sequences unclassified (0.01%)
(flextaxd) prateek@t620:/exports/watson/Prateek/temp_projects/rick_16s/filtered/trial$ bracken -r 250 -d /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/ -i 1766.k2 -o 1766.bout -w 1766.brep -l R5
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i 1766.k2 -o 1766.bout -k /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/database250mers.kmer_distrib -l R5 -t 10
PROGRAM START TIME: 03-09-2021 12:07:23
Error: no reads found. Please check your Kraken report
(flextaxd) prateek@t620:/exports/watson/Prateek/temp_projects/rick_16s/filtered/trial$ bracken -r 250 -d /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/ -i 1766.k2 -o 1766.bout -w 1766.brep -l G
>> Checking for Valid Options...
>> Running Bracken
>> python src/est_abundance.py -i 1766.k2 -o 1766.bout -k /exports/watson/Prateek/dbs/16s_gtdb/v2/GTDB_orig_krakendb/database250mers.kmer_distrib -l G -t 10
PROGRAM START TIME: 03-09-2021 12:07:26
Error: no reads found. Please check your Kraken report
4) view the report file:
0.01 5 5 U 0 unclassified
99.99 49210 0 R 1 root
99.99 49210 0 R1 2 cellular_organism
99.99 49210 220 R2 4 Bacteria
74.97 36897 0 P 1000034 Firmicutes_A
74.72 36773 684 C 1000035 Clostridia
38.86 19125 15 O 1000404 Lachnospirales
38.71 19050 7487 F 1000405 Lachnospiraceae
8.82 4343 4175 G 1000626 Blautia_A
0.28 138 138 S 1003577 Blautia_A sp900066165
It is good news that you can use it, that just means im messing up somewhere... :(
I´m not sure if there is any difference on moving the genomes back into the kraken or just keeping the library?
If you add --keep
when running flextaxd-create both the taxonomy directory and the library will be kept and you can run bracken-build directly.
Ok, i dont understand the why, but rebuilding the database with --keep did the trick! I have kraken2+bracken pipeline working flawlessly now! I cant thank you enough for all your support. Please do let me know if you think i can help in the future with any kind of testing at all!
If you were open to a suggestion, i would suggest to reach out to the GTDB people and mention that you have got a working pipeline to build kraken2 compatible 16s GTDB database. I will also spread the word :)
Hope you have a great week ahead!
until next time :)
That is great news :D
Excellent, I will update the walkthrough with some more explanatory text and make sure the --keep option is also well displayed. It would be great if you could have a look at it later and see if it is easy to follow, since you now have some experience of details and where things can easilly be misunderstood!
That is a good suggestion thank you, I will try to form the walkthrough after the process we went through here!
Thanks a lot and good luck with the downstream analysis!
I can do that, Ill also ask a couple of my friends to try it out and make sure it works in with different servers.
speaking on downstream analysis, do you use kraken-biom to merge all the report files? Im not sure why, but the number of OTUs drops a lot for me once i use kraken-biom to merge the reports. I generally load the biom file into phyloseq. Do you have an alternate suggestion?
That sounds wonderful thanks a lot!
I have not used kraken2 for 16S analysis yet. Although this is a future task, at the moment I can´t give a good answer to your OTU question at the moment.
I wasnt sure which thread to continue the conversation in, but since the pervious thread is merged with this, i thought I would add comments and issues in this thread. Please correct me if im wrong.
Thank you for helping me with the new version of FlextaxD. I however still have issues with building the 16s Kraken2 database. I have added the commands that i used below. I built the new 0.3.6 flextaxdb with python since for some reason it wouldnt install with conda.
Version check
Follow commands to build Kraken2 database:
Im not sure if this is a python issue or something that i am doing wrong.
Thank you!