DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
727 stars 273 forks source link

Kraken2 database: Merge GTDB and viral refseq #273

Open HitMonk opened 4 years ago

HitMonk commented 4 years ago

Hello everyone, I was trying to build Krake2 databases with GTDB. However, since GTDB consists of only bacterial and archaeal sequences it would be ideal to build it along with the viral Refseq. Im not sure if this is even possible or compatible with bracken downstream. Please let me know if you have any suggestions.

jenniferlu717 commented 4 years ago

The only thing that I would be concerned about is the taxonomy. If the GTDB taxonomy does include the viral refseq taxonomy IDs, then it should work just fine.

Bracken also would not be affected by including both of these databases prior to building.

HitMonk commented 4 years ago

Im sorry if this is a stupid question but how can i check if GTDB taxonomy has the viral refseq taxonomy IDS? Also, is it possible for me to just merge the viral taxonomy ids with the GTDB taxonomy? I had to merge archaeal and bacterial taxonomy ids as GTDB provides them separately.

jenniferlu717 commented 4 years ago

You can check if the taxonomy IDs for one of the viral sequences is in the GTDB taxonomy files.

If they're not, you will have to extract all of the taxonomy IDs belonging to viral sequences (and their parent taxids) from the Refseq taxonomy and merge that with the GTDB taxonomy (I believe just making sure that "Viruses" is connected to root would be enough)

HitMonk commented 4 years ago

that makes sense... Ill try this and report back in a week or so. Thank you so much for your help!

Rohit-Satyam commented 2 years ago

Hi!! We're your u able to achieve this??

HitMonk commented 2 years ago

@Rohit-Satyam Yep! I used FlextaxD to merge the databases. https://github.com/FOI-Bioinformatics/flextaxd