DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
687 stars 267 forks source link

Nucleotide v. protein databases. #200

Open vpricha opened 4 years ago

vpricha commented 4 years ago

I have run into an interesting situation when comparing assignments between nucleotide and protein databases. My databases contain NCBI bacteria, fungal, and human sequences built using kraken2-build --download-library *

My samples contain bacteria and human reads (RNAseq libaries). When comparing taxonomic assignments between the databases, the bacteria are very similar. However, there is a large discrepancy between the assignment proportion for the human. Using the nucleotide database, human is on average ~ 30% of the reads/sample. Using the protein database, human is ~ 5%, with a corresponding increase in un-assigned reads. I've checked the databases and they contain the same human sequence data.

Thanks in advance.

Vince

jenniferlu717 commented 4 years ago

Are the read counts about the same or are they also far off? Did you see whether the human reads classified by the nucleotide database were classified as something else in the protein database? Or are the human reads now unclassified?

I'm not sure what your question is/how to compare these results.

vpricha commented 4 years ago

Hi Jennifer,

Here’s a summary. I have four samples. I classified each sample once using a nt db and once using a protein db (human, fungus, bacteria). As you can see, the bacterial assignments remained about the same; but, the chordate assignments were much lower when using the protein db (becoming unclassified). I would have assumed that we would get more assignments for the protein db given the allowed variation at the third codon position.

Thanks in advance for your help!

Vince

[A screenshot of a cell phone Description automatically generated]

Vincent P. Richards, PhD Assistant Professor Department of Biological Sciences Clemson University Clemson, SC 29634 vpricha@clemson.edumailto:vpricha@clemson.edu http://www.vprichards-lab.com

[signature_1092895409]

From: jenniferlu717 notifications@github.com Reply-To: DerrickWood/kraken2 reply@reply.github.com Date: Friday, January 31, 2020 at 6:39 PM To: DerrickWood/kraken2 kraken2@noreply.github.com Cc: Vince Richards vprichards@gmail.com, Author author@noreply.github.com Subject: Re: [DerrickWood/kraken2] Nucleotide v. protein databases. (#200)

Are the read counts about the same or are they also far off? Did you see whether the human reads classified by the nucleotide database were classified as something else in the protein database? Or are the human reads now unclassified?

I'm not sure what your question is/how to compare these results.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/DerrickWood/kraken2/issues/200?email_source=notifications&email_token=AN2X44FG3ETJ6HO7EWUZSHDRASZCJA5CNFSM4KDGYBT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKQLICA#issuecomment-580957192, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AN2X44DNJALL6PWN33W2WC3RASZCJANCNFSM4KDGYBTQ.

solymosin commented 4 years ago

Dear Jennifer,

we have been playing with this phenomenon, please find the link: https://github.com/solymosin/nuc_or_prot/blob/master/Toth_AG.pdf

all the bests, Norbert

vpricha notifications@github.com ezt írta (időpont: 2020. febr. 4., K, 22:15):

Hi Jennifer,

Here’s a summary. I have four samples. I classified each sample once using a nt db and once using a protein db (human, fungus, bacteria). As you can see, the bacterial assignments remained about the same; but, the chordate assignments were much lower when using the protein db (becoming unclassified). I would have assumed that we would get more assignments for the protein db given the allowed variation at the third codon position.

Thanks in advance for your help!

Vince

[A screenshot of a cell phone Description automatically generated]

Vincent P. Richards, PhD Assistant Professor Department of Biological Sciences Clemson University Clemson, SC 29634 vpricha@clemson.edumailto:vpricha@clemson.edu http://www.vprichards-lab.com

[signature_1092895409]

From: jenniferlu717 notifications@github.com Reply-To: DerrickWood/kraken2 reply@reply.github.com Date: Friday, January 31, 2020 at 6:39 PM To: DerrickWood/kraken2 kraken2@noreply.github.com Cc: Vince Richards vprichards@gmail.com, Author < author@noreply.github.com> Subject: Re: [DerrickWood/kraken2] Nucleotide v. protein databases. (#200)

Are the read counts about the same or are they also far off? Did you see whether the human reads classified by the nucleotide database were classified as something else in the protein database? Or are the human reads now unclassified?

I'm not sure what your question is/how to compare these results.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub< https://github.com/DerrickWood/kraken2/issues/200?email_source=notifications&email_token=AN2X44FG3ETJ6HO7EWUZSHDRASZCJA5CNFSM4KDGYBT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKQLICA#issuecomment-580957192>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/AN2X44DNJALL6PWN33W2WC3RASZCJANCNFSM4KDGYBTQ>.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DerrickWood/kraken2/issues/200?email_source=notifications&email_token=ACD7FNQ6ZWTLOHG4A4ZARVLRBHLHHA5CNFSM4KDGYBT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKZGYZA#issuecomment-582118500, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACD7FNWUWS2TKW6BEV5I4IDRBHLHHANCNFSM4KDGYBTQ .