NorwegianVeterinaryInstitute / Talos

A shotgun metagenomic analysis pipeline using nextflow
BSD 3-Clause "New" or "Revised" License
1 stars 2 forks source link

create kraken2 database with host genomes #24

Open Thomieh73 opened 4 years ago

Thomieh73 commented 4 years ago

Because this pipeline is going to be used by multiple project that have a variety of different host organisms it is needed to create a database that contains all these hosts. In the first place to be abel to remove those reads that match these genomes, but also to check how much of the reads are matching to these host genomes.

The host genomes are: Dog, Cow, Horse, Sheep, Pig, Chicken, and Salmon.

In order to do that I need to create host genomes that are treated in the following way.

  1. low -complexity regions should be masked.
  2. regions that match microbial genomes (because of contamination) should also be masked.
  3. regions that match common fungal genomes should also be masked.

If I would not mask these genomes for these regions, we run the risk of losing reads that are microbial origin, but due to contamination these genomes might contain them.

After I have masked those genomes, I then need to add them to a normal kraken2 database that consist of refseq microbial genomes, viruses, plasmids, and fungi.

Thomieh73 commented 4 years ago

Kraken 2 classification is added.

it will still be good to create a database to see how many reads match the host genomes