DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
734 stars 274 forks source link

kraken2 mapping only 10% metagenomic reads from the gut microbiome #236

Open luthientinuviel opened 4 years ago

luthientinuviel commented 4 years ago

Hi there,

I am seeing very low mapping rates (~10-15%) for my whole genome sequencing data using Kraken 2. Samples are gut microbiome samples from patients with heart disease, sequenced on Illumina HiSeq X, paired end 150bp, 350 bp insert. The aforementioned mapping rate is after we use Kneaddata to remove human reads, adapters, etc.

1) This is my first time using metagenomics for gut microbiome, but, as best as I can tell based on the literature and speaking to some folks who have worked with gut microbiome metagenomics and Kraken 2, the "expected" mapping rate for human gut microbiome is >50% -- is this indeed correct? When we use Humann2 for functional assessment for the same samples, there is a 50-60% mapping rate.

2) What may be the reason that I am getting such low mapping rate?

Thanks so much for your help, LT

jenniferlu717 commented 4 years ago

It really depends on the type of sample and the database you are using. We have had some human samples that had 90-99% human in them.

What database are you using? Have you tried running without first removing human reads etc?

luthientinuviel commented 4 years ago

Hi Jen,

The database we use is the standard Kraken 2 database (which, per the website, contains "NCBI taxonomic information, as well as the complete genomes in RefSeq for the bacterial, archaeal, and viral domains, along with the human genome and a collection of known vectors (UniVec_Core)”).

We have gut microbiome samples from patients with heart disease.

10% mapping rate is after we remove human reads using Kneaddata — at this point it looks like approx 2% of reads get removed, which then leaves us with 98% of the raw reads.

I hope this info is helpful… Please let me know.

Thanks so much, Petra

On Apr 13, 2020, at 6:10 PM, Jen Lu notifications@github.com wrote:

It really depends on the type of sample and the database you are using. We have had some human samples that had 90-99% human in them.

What database are you using? Have you tried running without first removing human reads etc?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DerrickWood/kraken2/issues/236#issuecomment-613171919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2JIZN242G4AMXKLOUG443RMOZZTANCNFSM4MHKWYXQ.

jenniferlu717 commented 4 years ago

It could be that the sample is more fungal/eukaryotic and not bacterial/viral. You could try building a database with the fungal/protozoa libraries

kraken2-build --download-library protozoa kraken2-build --download-library fungi

luthientinuviel commented 4 years ago

Thanks Jen. I will try doing this. We are having the same issue with several samples, and I have no reason to suspect that these individuals should have GMB that are heavily protozoal/fungal (as opposed to being bacterial/viral), though that’s certainly possible. I guess my question is also — for an “average” human gut microbiome, what is the “expected” percent of mapped reads (at least in your experience)?

Thanks so much, Petra

On Apr 13, 2020, at 6:22 PM, Jen Lu notifications@github.com wrote:

It could be that the sample is more fungal/eukaryotic and not bacterial/viral. You could try building a database with the fungal/protozoa libraries

kraken2-build --download-library protozoa kraken2-build --download-library fungi

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DerrickWood/kraken2/issues/236#issuecomment-613175068, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2JIZN3UZVDT4C4KM4SG2TRMO3EDANCNFSM4MHKWYXQ.

jenniferlu717 commented 4 years ago

I'm not actually sure. We worked on brain and corneal samples, not gut ones. Let me know what you find with the other database? The other option is building an nt database to try and get the largest picture of what is in the database.

luthientinuviel commented 4 years ago

Thanks Jen. If anyone else is willing to share their experience with Kraken 2 in human gut microbiome, I would really appreciate hearing about it.

Petra

On Apr 13, 2020, at 6:30 PM, Jen Lu notifications@github.com wrote:

I'm not actually sure. We worked on brain and corneal samples, not gut ones. Let me know what you find with the other database? The other option is building an nt database to try and get the largest picture of what is in the database.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DerrickWood/kraken2/issues/236#issuecomment-613177126, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2JIZM4JLL7LQSHEPOYGXLRMO4DXANCNFSM4MHKWYXQ.

alimayy commented 4 years ago

Hi Petra,

Our experience with healthy human gut (faceal) samples is that 70-80% of the reads is taxonomically classified. 10% is too low, using the default Kraken2 databases, I would expect that number only from a much less studied environment like the rumen microbiome.

Can it be that you are using a very high confidence threshold? https://ccb.jhu.edu/software/kraken2/index.shtml?t=manual#confidence-scoring

Another possibility is a problem with the database during downloading or building. I would try a fresh install.

If nothing works, a nice way to troubleshoot is assembling the unclassified reads (you can output them) using e.g. MEGAHIT, and checking by e.g. Blast the taxonomic labels of longest scaffolds. You can compare these labels to what should in principle be in your Kraken2 db to troubleshoot further.

luthientinuviel commented 4 years ago

Hi there,

Thanks for getting back to me. 70-80% makes more sense to me. I will check the confidence threshold. The other issue is that when we go ahead with Bracken on this data, we are getting ~4000 species total (this is for ~200 patients) and vast majority of them are viruses and phages, with only some bacteria. Additionally, bunch of bacteria that we would expect to see taxonomically annotated (e.g. Faecalibacterium, B.theta, common Prevotellas, etc) are not present in any of our samples. I suspect we’ll have to do the fresh install, but what do you make of this?

Thanks so much, Petra

On Apr 17, 2020, at 2:56 PM, alimayy notifications@github.com wrote:

Hi Petra,

Our experience with healthy human gut (faceal) samples is that 70-80% of the reads is taxonomically classified. 10% is too low, using the default Kraken2 databases, I would expect that number only from a much less studied environment like the rumen microbiome.

Can it be that you are using a very high confidence threshold? https://ccb.jhu.edu/software/kraken2/index.shtml?t=manual#confidence-scoring https://ccb.jhu.edu/software/kraken2/index.shtml?t=manual#confidence-scoring Another possibility is a problem with the database during downloading or building. I would try a fresh install.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DerrickWood/kraken2/issues/236#issuecomment-615481842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2JIZORD5VFICMV3OQ4HD3RNDGCBANCNFSM4MHKWYXQ.

alimayy commented 4 years ago

It sounds like there was a problem during the downloading or integrating of the sequences under <kraken2dbname>/library/bacteria/library.fna

With the standard database options, this file should be at about 70 Gb. When the default kraken2 db is built taking along this file and other libraries (viruses, etc), the resulting hash.k2d should be at around 43 Gb. If these numbers don't match, then you might be missing a large section of the (if not all) bacterial db.

Ali

luthientinuviel commented 4 years ago

This is probably the issue — our hash.k2d is ~8 gb. Just to clarify, is the “standard” database the same as “default kraken2 db”? And what is hash.k2d exactly?

Thanks so much, Petra

On Apr 17, 2020, at 5:56 PM, alimayy notifications@github.com wrote:

It sounds like there was a problem during the downloading or integrating of the sequences under /library/bacteria/library.fna

With the standard database this file should be at about 70 Gb. When the default kraken2 db is built taking along this file other libraries (viruses, etc), the resulting hash.k2d should be at around 43 Gb.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DerrickWood/kraken2/issues/236#issuecomment-615525572, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2JIZPKSNLSNARMEFWYCFLRND3FLANCNFSM4MHKWYXQ.

luthientinuviel commented 4 years ago

(Please ignore the question about hash.k2d, I got it. Thanks)

On Apr 17, 2020, at 9:18 PM, Petra Mamic pmamic1@gmail.com wrote:

This is probably the issue — our hash.k2d is ~8 gb. Just to clarify, is the “standard” database the same as “default kraken2 db”? And what is hash.k2d exactly?

Thanks so much, Petra

On Apr 17, 2020, at 5:56 PM, alimayy <notifications@github.com mailto:notifications@github.com> wrote:

It sounds like there was a problem during the downloading or integrating of the sequences under /library/bacteria/library.fna

With the standard database this file should be at about 70 Gb. When the default kraken2 db is built taking along this file other libraries (viruses, etc), the resulting hash.k2d should be at around 43 Gb.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DerrickWood/kraken2/issues/236#issuecomment-615525572, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2JIZPKSNLSNARMEFWYCFLRND3FLANCNFSM4MHKWYXQ.

alimayy commented 4 years ago

Yes, what I meant by standard and default is https://ccb.jhu.edu/software/kraken2/index.shtml?t=manual#standard-kraken-2-database

luthientinuviel commented 4 years ago

Thank so much! We’ll do the fresh library install. Petra

On Apr 18, 2020, at 3:26 AM, alimayy notifications@github.com wrote:

Yes, what I meant by standard and default is https://ccb.jhu.edu/software/kraken2/index.shtml?t=manual#standard-kraken-2-database https://ccb.jhu.edu/software/kraken2/index.shtml?t=manual#standard-kraken-2-database — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DerrickWood/kraken2/issues/236#issuecomment-615837933, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2JIZNI7WKR4CP4GZJWT3TRNF56DANCNFSM4MHKWYXQ.

drish91 commented 4 years ago

Thanks to everyone for these very helpful troubleshooting suggestions!

I'm having very similar issues with some of my nasal microbiome samples. Post decontamination with kneaddata using the latest human ref GRCh38, kraken (using the standard db) still shows a mapping rate of >75-90% for human reads in the "clean" fastqs, which is strange because kneaddata already removes ~80-90% of the raw reads, leaving only ~10-15% host-removed clean data. Not quite sure why kraken still identifies human reads when kneaddata claims there are none.

To troubleshoot further, I took the assembly suggestion above by @alimayy, and assembled one of these samples with metaspades. The taxonomy of the genome belonged to genus Moraxellaceae (o. Pseudomonadales), which tells me a majority of those reads should be bacterial.

I also checked the library sizes for the standard kraken database downloaded, and the library.fna is 70G, while the hash.k2d is ~43G. So it isn't a database issue either.

Could kraken be mis-classifying reads? I don't want to use a "bacteria-focused" database, since that defeats the purpose of checking for host contamination.

Would love to hear thoughts on this!

pxo289 commented 4 years ago

Hi all, I've run into the same issue.

We're working with some under studied sample sets (faeces from a not previously investigated arid-adapted mammal) and as part of establishing what tools to use I ran the PE reads through Kraken2 (after removing host and human contamination). We got anywhere from 20-0.5% classification depending on confidence score, which was fine given we anticipated poor classification.

However, as a check I ran Kraken2 on a mock human gut community modelled on the HMP which contained 20 species with known relative abundances at 5, 10, 25, 50, 75 and 100% confidence levels. These results were worrying, we never managed to classify any reads at the 50% or above conf. score - they were 100% unclassified.

We also had two different Kraken2 databases being employed, both from RefSeq with one from 2014 and one from Oct. 2019 - I can't speak to the exact nature of these as I didn't produce them -and I expected the number of false classifications at the species level (not one of the 20 known members) to be lower in the newer database - a pattern which held true for confidence levels 5,10 and 25%.

However the complete failure to classify anything at the 50% confidence score or higher means I am now thinking about not using Kraken. If it was a complete database error I don't see why it would manage to classify reads below 50% confidence -more accurately than an older database- then stop classifying anything from 50% on.

Any advice about this would be appreciated.

jenniferlu717 commented 4 years ago

@drish91 I would use --minimum-hit-groups 4 to make sure its not getting false positives and a higher confidence score (--confidence 0.2 maybe?) just in case. A single shared kmer between human and bacteria can cause some of the human hits. It might even be human contamination in the bacterial genome that is causing Kraken to misclassify the reads.

@pxo289 your issue is a bit different. I believe something to note is that the default Kraken2 downloads mask low complexity genome sequences (removes these low complexity genomes) before making the database. So depending on how many repetitive sequences are in a genome, a certain number of kmers will be unclassified. For example, the Plasmodium (malaria) genomes tend to be much more repetitive and therefore, Kraken might remove 30% of the total kmers. That means any reads coming from that part of the genome will be unclassified.

I don't suggest a 50% confidence score, its just too high. I think 10% confidence score is sufficient in most cases. This means that you need 10% of kmers in a read to match the database for it to be called.

I also would say that the human gut community is much more likely to have matches in the kraken2 database (given that the database is primarily human/bacteria/viral/archaeal).

However, your other sample is likely to have far less matching because the default Kraken2 databases do not include any vertebrates/plants/fungi apart from human. The default database has human, bacterial, viral, archaeal, and maybe plasmids (I don't fully remember).

Hope this helps.

csmatyi commented 4 years ago

Hello all, I have a problem classifying reads from bacteria to bacterial genomes. I am downloading read data sets from SRA and running them against my kraken2 db. Viruses work pretty well. It's just that I am not able to find back my bacterial genomes with these reads sets. I have tried building the kraken2 db with and without adding human, but in no way do I retrieve bacterial sequences.

rjsorr commented 2 years ago

bit late addition to this one. From my own data, this indicates two things: 1. Your database does not reflect your sample; if you are getting 10% assigned against bacteria it may simply be that 90% are not bacteria. To this end it is wise to do a kingdom level classification first before going to the domain level. 2. Your samples are novel and even though you have a database constituting the specific domain it does not contain a good reference to classifiy your sample. An example of this is when updating the refseg CG database, mine was 1 year old, increased classification by an average of 2%. A bigger database simply gives more classified reads, but to reiterate a high degree of novelty is to be expected, especially for prokaryotes. @jenniferlu717 , I wondering if you have a good method for doing a kingdom level classification prior to going to domain level, one that is not limited by database biases? Diamond-->Megan is slow and not made for todays datasets, CG is no go due to bias in sampling, maybe refseq proteins/transcripts are a better option? or maybe a completely different method you may know? can alawys open this as a seperate issue of needs be? regards

CMagnoBR commented 1 year ago

Hi there, Same issue here. I have used a 16GB capped database available here: https://benlangmead.github.io/aws-indexes/k2, that includes archaea, bacteria, viral, plasmid, human1, UniVec_Core, protozoa & fungi taxids. However, the max classified percentage is about 20. I have used --confidence 0.0 --minimum-hit-groups 1 to improve the number of classified reads. My samples are from a metagenomic rhizosphere study. When I used the same samples in the Galaxy server, with a prebuild database ("PlusPF 2021-05-17), results in about 50 % of sequences classified. Someone can help me, please. Regards.

rjsorr commented 1 year ago

@CMagnoBR so the only difference between the databases is the 16gb cap? I have never run a capped database so I don't know the process. Either way I would encourage you to make your own database and use genomeupdater to do so. I imagine you are expecting a lot of fungi and the plusPF is not suited to your needs (how many fungi genomes are in refseqCG?). Also you are going to get a lot of false positives with those parameters! Yes you get more classified but you cant trust the result. Google "kraken2 --confidence" to see the discussion. Basically it is better to have stringent paramets to reduce false positives at the cost of getting a higher total classification.