DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
706 stars 270 forks source link

How to access the reference.fa used to build the kraken2 DB? #567

Open RichardCorbett opened 2 years ago

RichardCorbett commented 2 years ago

Hi folks,

I am using Kraken2 to classify paired-end illumina reads. Most commonly I want to check for known cancer-associated organisms (EBV, HPV, etc.) in my data, but I am always interested to see what else is in there.

To evaluate the presence of a non-human organism I first search for all species listed above some minimum fraction of reads or minimum read count in the kraken_report. Then I use a block of code (example below) to pull down an associated .fa file from NCBI. Then I align the reads that kraken2 reported match the genome and check to see what fraction of the genome is covered, or how many unique alignments are present. This last step helps identify kraken-detected microbes that are similar to adapter or reagent sequence as reads in those cases usually all stack up in a tiny region of a microbe's genome.

My messy code to pull down the reference file. species_file contains the organism name I pulled from the kraken_report. For example, it might be Human gammaherpesvirus 4

spcs=`cat ${species_file}`
        search_result=`esearch -db assembly -query "\${spcs}" | efetch -format docsum | xtract -pattern DocumentSummary  -element FtpPath_GenBank | head -n 1`
        b_name=`basename \$search_result`
        download_path=\$search_result/\$b_name'_genomic.fna.gz'
        fasta_name=`cat ${species_file} | tr ' ' '_'`
        curl \${download_path} -o ${species_file.baseName}.fna.gz

You'll note that I am taking the first returned hit (head -n 1), which is certainly not the best option, and occasionally gives me back low quality references.

So my question is - what is the best way to get the reference genome for any hit detected in kraken2?

jenniferlu717 commented 2 years ago

Are you using a prebuilt kraken database or did you build your own?

I personally manually look up the reference genome for the given species/strain.

RichardCorbett commented 2 years ago

I am using the pre-built kraken DB. Do you have any tips about how to look up the references? You'll see what I've been doing above, but it is problematic in that I don't have a good way to know I'm using the same version of the reference genome that was used to build the kraken database.

jenniferlu717 commented 2 years ago

Manually, I just look up the species/strain from ncbi: https://www.ncbi.nlm.nih.gov/taxonomy and that would just give the reference genome. I don't know how to do it via command line though.

The kraken database uses only complete genomes, so if you can filter for genomes that are labelled by "Complete" in ncbi, that should help

RichardCorbett commented 2 years ago

Thanks @jenniferlu717 ,

I am trying to use ncbi-genome-download which greatly simplifies what is needed to download. However, if I filter say, for HPV using:

ncbi-genome-download -g "Human papillomavirus type 16" -F fasta   -l complete all

I get 2 hits:

./GCF_000863945.3
./GCF_000863945.3/MD5SUMS
./GCF_000863945.3/GCF_000863945.3_ViralProj15505_genomic.fna.gz
./GCF_002826985.1
./GCF_002826985.1/MD5SUMS
./GCF_002826985.1/GCF_002826985.1_ASM282698v1_genomic.fna.gz

do you have any tips about how to know which, if either, of these was used in kraken2?

jenniferlu717 commented 2 years ago

Kraken uses any sequence that is complete. It could be that it uses both.