benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
148 stars 25 forks source link

Identify contaminants by frequency - errors and warnings in r #13

Closed stangedal closed 6 years ago

stangedal commented 6 years ago

Hi there and thank you for your help with my last issue! Now I have been trying to run decontam on a subset of my samples but I am running into some trouble I cannot find a solution for my self. I hope you can bear with me as I go through it!

My data: 106 sputum samples of which 72 will be included in my actual study. I have DNA-consentration measurements for 60 of these 106 - 10 qubit, 50 picogreen. I thought I would run decontam on the whole lot before picking the participants for my subgroup analyses. My plan is to identify contaminants in the 60 samples where this is possible, and remove the ASVs from the entire dataset.

I am using qiime2 and silva with a self-trained classifier (99%). Qiime2 allows me to transform my asv.qza file to a biom file, and I get to add the taxonomic information (7 levels), and further add metadata with 1 variable containing the DNA concentration measurements (quant_reading) and 1 variable "run" allowing me to run batch option in decontam.

I import the biom table with taxa and metadata in r: biom1 <- biomformat::read_biom("ASVwithtaxaandmeta.biom") And r tells me: Warning message: In strsplit(msg, "\n") : input string 1 is invalid in this locale

I cannot figure out what this means and after a long day of googling and attempting different commands to see if I could get rid of the problem I gave up ;) So I went on with names(biom1) and all looks in order. [1] "id" "format" "format_url" "type" "generated_by"
[6] "date" "matrix_type" "matrix_element_type" "rows" "columns"
[11] "shape" "data"

Creating a phyloseq object also seems to be working:

ASV <- import_biom(biom1) ASV

phyloseq-class experiment-level object

otu_table() OTU Table: [ 3561 taxa and 106 samples ]

sample_data() Sample Data: [ 106 samples by 33 sample variables ]

tax_table() Taxonomy Table: [ 3561 taxa by 7 taxonomic ranks ]

sample_variables(ASV) gives a meaningful output, and as I know I am getting some issues with quant_reading latter on I also ran:

get_variable(ASV, "quant_reading") The first line of output being: [1] "37.7" "25.3" "25.9" "55.2" "61.7" "35.9" "37.6" "36.6" "14.5" "54"

In my data I now have 46 samples without dna measurements so I made a new phylosec object containing only those 60 with DNA concentration. ASV_pico <- subset_samples(ASV, decontambatch_1q_2p=="2") ASV_qubit <- subset_samples(ASV, decontambatch_1q_2p=="1") ASV_measure <- merge_phyloseq(ASV_qubit, ASV_pico)

contamdf.freq <- isContaminant(ASV, method="frequency", conc="quant_reading")

Error in isContaminant(ASV, method = "frequency", conc = "quant_reading") :

conc must be positive numeric.

WEIRD! So I check it again: get_variable(ASV_measure, "quant_reading") [1] "37.7" "25.3" "25.9" "55.2" "61.7" "35.9" "37.6" "36.6" "14.5" "54" "49.4"
[12] "53.7" "27.5" "29.4" "61.5" "32.9" "45.9" "36.4" "56" "30.2" "51.333" "42.1"
[23] "46" "66.2" "56.9" "72.4" "53" "66.2" "57.1" "37.9" "52.1" "7.31" "36.8"
[34] "40.9" "39.1" "66.3" "101.6" "71.9" "97.3" "91.9" "61.7" "22.7" "36" "45.7"
[45] "69.3" "107" "43.3" "90.4" "95.9" "84.8" "70.5" "79.5" "93.133" "59" "46.3"
[56] "56.9" "50.2" "47.7" "56.533" "54.5"

In the global environment in r the ASV files are all typed phyloseq so it should recognize quant_reading. To try to get around it I do as follows:

contamdf.freq <- isContaminant(ASV_measure, conc = c(37.7,25.3,25.9,55.2,61.7,35.9,37.6,36.6,14.5,54,49.4,53.7,27.5,29.4,61.5,32.9,45.9,36.4,56,30.2,51.333,42.1,46,66.2,56.9,72.4,53,66.2,57.1,37.9,52.1,7.31,36.8,40.9,39.1,66.3,101.6,71.9,97.3,91.9,61.7,22.7,36,45.7,69.3,107,43.3,90.4,95.9,84.8,70.5,79.5,93.133,59,46.3,56.9,50.2,47.7,56.533,54.5), method = "frequency", batch = "run")

freq prev p.freq p.prev p contaminant 101637aba474a6614786486ad42acb11 0.12372366 60 0.5953509 NA 0.5953509 FALSE 6e7317c14238561fe5865674e3a3206d 0.04367434 55 0.5497018 NA 0.5497018 FALSE 24522f7485582e07564aed7c426af59d 0.03783118 50 0.3527509 NA 0.3527509 FALSE 409f711b59152d57926cf444c5577087 0.02889763 53 0.4993289 NA 0.4993289 FALSE 55ae36374045b6a58fdc9c11936be5cc 0.02112490 54 0.4048830 NA 0.4048830 FALSE 0a018c977acabf99cf7be6c718ff4c66 0.00522744 12 0.4990972 NA 0.4990972 FALSE

So now I am getting my hopes up, and I run:

table(contamdf.freq$contaminant) telling me that 12 of 3549 ASV are contaminants.

which(contamdf.freq$contaminant)

[1] 78 157 327 354 370 499 578 764 882 1271 1485 1765

I move on to your plots and also when plotting I have to drop the quant_reading so the command looks like a mess: plot_frequency(ASV_measure, taxa_names(ASV_measure)[c(78,157,327,354,370,499,578,764,882,1271,1485,1765)], conc = c(37.7,25.3,25.9,55.2,61.7,35.9,37.6,36.6,14.5,54,49.4,53.7,27.5,29.4,61.5,32.9,45.9,36.4,56,30.2,51.333,42.1,46,66.2,56.9,72.4,53,66.2,57.1,37.9,52.1,7.31,36.8,40.9,39.1,66.3,101.6,71.9,97.3,91.9,61.7,22.7,36,45.7,69.3,107,43.3,90.4,95.9,84.8,70.5,79.5,93.133,59,46.3,56.9,50.2,47.7,56.533,54.5))

But I do get a plot (Plot1) where out of the 12 ASVs, 5 plots are empty, and then there is this mysterious NA coming along at the end...

It gets even stranger when I choose to increase the threshold to 0.5, where I get 323 contaminants - the first output of which(...) being: which(contamdfPrev05.freq$contaminant)

[1] 3 4 5 6 9 10 11 13 14 20 24 25 28 31 33 35 36 37 38 39

So most of the dominating ASV/OTUs are now interpreted as contaminants - but what do plotting this information give me: Plot2.

I am getting awfully confused here by these empty plots as I know that there is indeed sequences belonging to these prominent ASVs in all samples. And I cannot figure out why quant_reading is not being recognized? Is it possible for you to comment on these confusing elements? Plot1.pdf Plot2.pdf

benjjneb commented 6 years ago

First off, do not mix Qubit and Picogreen measurements! They are not the same, and very likely will result in incorrect classifications. You should probably just use the Picogreen samples in this case since there are 50 of them.

Second, the issue with using the phyloseq object directly is that your "quant_reading" variable is stored as strings, rather than numbers. That is easily fixed:

sample_data(ASV_pico)$quant_reading <- as.numeric(get_variable(ASV_pico, "quant_reading"))
# And then proceed w/ phyloseq object

Can you try fixing those issues and see where that gets you?

stangedal commented 6 years ago

OK - so using the batch option to separately identify contaminants in the 10 samples where qubit was used and the 50 where picogreen was used is not an option after all?

I got my strings into numbers thank you so much - the results are the same when it comes to plots, but that is tested on all 60, and not only on the 50 pico. I have to run that tomorrow! Solveig

So I had to test it this evening

I am glad to see quant_reading as numeric values! Running only those 50 samples measured with picogreen I found 5 contaminants with the default threshold, when raised to 0.2: 20 contaminants, at 0.5 252 contaminants.

which(contamdfPico.freq$contaminant)

[1] 764 882 1271 1485 1765

which(contamdfPico02.freq$contaminant) ##Threshold 0.2

[1] 205 326 328 399 456 533 535 556 575 764 840 882 1271 1351 1369 1388 1479 1485 1500 1765

which(contamdfPico05.freq$contaminant) ##Threshold 0.5 - only showing first line of output

[1] 6 9 11 20 24 28 33 38 40 45 46 47 49 53 56 57 60 62 67 85 92 97

plot_frequency(ASV_pico, taxa_names(ASV_pico)[c(7,764,882,1271)], conc = "quant_reading") Warning messages: 1: Transformation introduced infinite values in continuous y-axis 2: Removed 1000 rows containing missing values (geom_path). 3: Removed 1000 rows containing missing values (geom_path).

And again in the plots I get the one plot showing NA - and empty plots... I am not sure what this tells me, I just get convinced that I am doing something wrong with my files? Plot1_pico.pdf

stangedal commented 6 years ago

Hi again! We have kept on working with the plots and made the observation that when plotting some taxa neither the Black dotted line nor the red Line is being drawn. When this is the case we can view the dotplot of that single asv, but trying to run plot_frequences with 2 or more asv the "Line free" plot is called NA and the plot given the asv name is blank! Have you encountered this problem before?

benjjneb commented 6 years ago

I haven't seen that.

Can you post a minimal example, i.e. with just one ASV that is showing this behavior, and the plot output with that single asv (which works) and with another asv (when it doesn't work)?

stangedal commented 6 years ago

ASV_pico with threshold 0.1 give taxa number 1 as non-contaminant, number 764 as contaminant.

plot_frequency(ASV_pico, taxa_names(ASV_pico)[c(1)], conc = "quant_reading") plot_frequency(ASV_pico, taxa_names(ASV_pico)[c(764)], conc = "quant_reading") plot_frequency(ASV_pico, taxa_names(ASV_pico)[c(1,764)], conc = "quant_reading")

Taxa1and764.pdf Taxanumber1.pdf Taxanumber764.pdf

benjjneb commented 6 years ago

Thanks, that looks like a bug to me.

Can you share the phyloseq object causing this with me? E.g. saveRDS(ASV_pico, "ASV_pico.rds").

Feel free to send it by email to benjamin DOT j DOT callahan AT gmail DOT com if you don't want it on this forum.

benjjneb commented 6 years ago

Fixed in 2c8c561b02c9d16e853a227ebf902fce83bc4e3d

R does not like names that begin with numbers, which was the issue here, as the taxa here are named by sha-sums, many of which start with a number. Should work now though.

stangedal commented 6 years ago

It does indeed! Thank you so much for all your help :)