Open RichardCorbett opened 2 years ago
Are you using a prebuilt kraken database or did you build your own?
I personally manually look up the reference genome for the given species/strain.
I am using the pre-built kraken DB. Do you have any tips about how to look up the references? You'll see what I've been doing above, but it is problematic in that I don't have a good way to know I'm using the same version of the reference genome that was used to build the kraken database.
Manually, I just look up the species/strain from ncbi: https://www.ncbi.nlm.nih.gov/taxonomy and that would just give the reference genome. I don't know how to do it via command line though.
The kraken database uses only complete genomes, so if you can filter for genomes that are labelled by "Complete" in ncbi, that should help
Thanks @jenniferlu717 ,
I am trying to use ncbi-genome-download
which greatly simplifies what is needed to download. However, if I filter say, for HPV using:
ncbi-genome-download -g "Human papillomavirus type 16" -F fasta -l complete all
I get 2 hits:
./GCF_000863945.3
./GCF_000863945.3/MD5SUMS
./GCF_000863945.3/GCF_000863945.3_ViralProj15505_genomic.fna.gz
./GCF_002826985.1
./GCF_002826985.1/MD5SUMS
./GCF_002826985.1/GCF_002826985.1_ASM282698v1_genomic.fna.gz
do you have any tips about how to know which, if either, of these was used in kraken2?
Kraken uses any sequence that is complete. It could be that it uses both.
Hi folks,
I am using Kraken2 to classify paired-end illumina reads. Most commonly I want to check for known cancer-associated organisms (EBV, HPV, etc.) in my data, but I am always interested to see what else is in there.
To evaluate the presence of a non-human organism I first search for all species listed above some minimum fraction of reads or minimum read count in the
kraken_report
. Then I use a block of code (example below) to pull down an associated .fa file from NCBI. Then I align the reads that kraken2 reported match the genome and check to see what fraction of the genome is covered, or how many unique alignments are present. This last step helps identify kraken-detected microbes that are similar to adapter or reagent sequence as reads in those cases usually all stack up in a tiny region of a microbe's genome.My messy code to pull down the reference file.
species_file
contains the organism name I pulled from thekraken_report
. For example, it might beHuman gammaherpesvirus 4
You'll note that I am taking the first returned hit (
head -n 1
), which is certainly not the best option, and occasionally gives me back low quality references.So my question is - what is the best way to get the reference genome for any hit detected in kraken2?