No true hits? - Githubissues

ghost commented 5 years ago

Hello,

this is a very nice package! Thank you.

I sequenced 84 biological samples (cotton swabs) obtained from a low biomass environment and 4 negative controls (blank swabs, no DNA) by using a shotgun metagenomics approach. Blank swabs accompanied biologial samples during sample storage, DNA extraction, library preparation and the sequencing run. Quality filtering was performed, and eukaryotic reads were identified and discarded. Reference-based assembly was chosen to identify bacterial reads.

1905 bacterial hits were observed. The outcome table has reads normalised to genome size and sequencing depth. I wanted to test the decontam package in R to identify contamination based on my blank swabs controls and remove contamination from patient samples by the prevalance method.

However, when I run the package. The outcome of table(contamdf.prev$contaminant) was the following: 1905 "FALSE" and 0 "TRUE" hits.

No matter which threshold I used within function "isContaminant()". The graphical outcome looked like this: Decontam_trial

My questions are:

Does this mean that the same species occur in negative and biological samples and appear to be evenly distributed? May this be an issue of cross-contamination? However, just for fun I manually added a quite exotic organism (not found in any of the biological samples) to my negative controls (10000 normalised read counts) and I added the same exotic organism to three patient samples with 100 normalised reads, but the outcome remained unchanged. Why?
Is it inappropriate to use my normalised sample counts? Do I have to use abundance values?
Is it a problem that I only have four control samples, but 84 real samples?

Thank you very much in advance!

Marie

benjjneb commented 5 years ago

Could you also show the histogram of the scores assigned by isContaminant? You can do that with something like hist(foo$score, n=100) if foo was what you assigned the output of isConatminant to. That will probably be more informative about the distribution of potential contaminants.

Is it a problem that I only have four control samples, but 84 real samples?

You are underpowered to detect contaminants with only 4 control samples. That is probably why you aren't detecting any at the default threshold of 0.1.

Is it inappropriate to use my normalised sample counts? Do I have to use abundance values?

No. By default isContaminant will normalize to proportions anyway.

Does this mean that the same species occur in negative and biological samples and appear to be evenly distributed? May this be an issue of cross-contamination? However, just for fun I manually added a quite exotic organism (not found in any of the biological samples) to my negative controls (10000 normalised read counts) and I added the same exotic organism to three patient samples with 100 normalised reads, but the outcome remained unchanged. Why?

Probably because the contaminant threshold of 0.1 is underpowered given the small number of control samples (see above). What is the $score assigned to that exotic organism? It should be below 0.5, but apparently not quite below 0.1.

ghost commented 5 years ago

Thank you so much for the fast reply!

No. By default isContaminant will normalise to proportions anyway.

I accidentally inserted my presence/absence species table, instead of the normalised counts. This explains why the effect of the manually added exotic organism was so low. I repeated the procedure and this time used my abundance table. I still get only 1905 false and no true contaminats but now it looks more like this:

contaminant

Could you also show the histogram of the scores assigned by isContaminant? You can do that with something like hist(foo$score, n=100) if foo was what you assigned the output of isConatminant to. That will probably be more informative about the distribution of potential contaminants.

The output of isContaminant does not contain a column called "score": dasdasd

But if I plot the frequency:

What is the $score assigned to that exotic organism? It should be below 0.5, but apparently not quite below 0.1.

Bildschirmfoto vom 2019-08-12 10-48-23

So you are probably right and four negative controls are just not enough to identify contamination. Do you have any other suggestions on what I could do to account for the negative controls instead?

Thanks for all your efforts!

Best wishes, Marie

benjjneb commented 5 years ago

That new plot looks very off, you shouldn't be getting prevalence greater than the number of samples you have.

The "score" is in the $p$ column. What does summary(foo$p) and hist(foo, 100) give as results?

Can you also post the exact command you are exectuing, i.e. foo <- isContaminant(...)? Because your results look kind of strange.

wangqiqi1995 commented 3 years ago

I have the similar problem as that . I have just one control , and every time ,I use one sample and one control and with prevalence method to identify contamination.I always get a result that all species is not contamination ,but I have known that almost of that is and the result is wrong . Maybe one control is not allowed in decontam?

benjjneb commented 3 years ago

Maybe one control is not allowed in decontam?

For the prevalence method, having only one control sample will result in no valid contaminant assignments.

This is as designed, and as appropriate. There is not enough information in a single control sample to confidently describe contaminant taxa.

benjjneb / decontam

No true hits? #52