benjjneb / decontam

Simple statistical identification and removal of contaminants in marker-gene and metagenomics sequencing data
https://benjjneb.github.io/decontam/
148 stars 25 forks source link

Importing the data #115

Open lynneitelson opened 2 years ago

lynneitelson commented 2 years ago

Hi!

I'm pretty new to all of this. I have finished the dada2 amplicon sequencing output and have the asv ouput. I added the code of the input, is the input of the original data correct? The original file has four samples + Negative and positive control.

Thank you!

decontem_data.csv

`` library(phyloseq) library(ggplot2) install.packages("decontem") library(decontam)

setwd("D:/MichaI20/Desktop/meta/bbduk/decontem") decontem_data <- read.csv("decontem_data.csv", header = TRUE) decontem_data[1:3, 1:3] ASV X1 X2 1 1 19677 6766 2 2 29402 8794 3 3 23870 7339

decontam_matrix <- data.matrix(decontem_data, rownames.force = NA) decontam_matrix[1:3, 1:3] X X1 X2 [1,] 2770 19677 29402 [2,] 1672 6766 8794 [3,] 2640 675 1232 decontem_metadata.csv decontem_data <- read.csv("decontem_metadata.csv", header = TRUE) decontem_data[1:2, 1:2] one true_sample 1 two true_sample 2 three true_sample decontam_matrix <- data.matrix(decontem_data, rownames.force = NA) decontam_matrix[1:2, 1:2] one true_sample [1,] 4 2 [2,] 3 2 "

benjjneb commented 2 years ago

It's probably easiest to not write things out to csv files, but to either save them as RDS files (saveRDS, readRDS) that maintain the full structure of the R object, or to just do both analyses in a single R session. For example, the dada2 tutorial ends with a section showing how to move the data into the phyloseq package, which works well with decontam.

That said, if you need to use .csv files as an intermediate, then you need to pay close attention to the options of the read.table or read.csv functions when you import the data, and then modify the imported data as needed to match the correct data format.

IN short, for the meta data you want a data.frame with rownames equal to the sample names. And you want the ASV table to have rows corresponding to samples, columns corresponding to ASVs, and be in matrix format. That is the data type that you will have at the end of the DADA2 tutorial workflow, but it won't be preserved in exactly that format when saving/reading from a csv file unless one is careful with the flags in those export/import functions.

lynneitelson commented 2 years ago

Thanks!

I used the phyloseq object from the dada2. However when i attempt to continue the code i get the following error:

`` ps <- phyloseq(otu_table(seqtab, taxa_are_rows=FALSE), tax_table(taxa1)) ps <- prune_samples(sample_names(ps) != "Mock", ps) # Remove mock sample

dna <- Biostrings::DNAStringSet(taxa_names(ps)) names(dna) <- taxa_names(ps) ps <- merge_phyloseq(ps, dna) taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps))) ps

head(sample_data(ps)) df <- as.data.frame(sample_data(ps)) "

Output: image

Thank you,

Lynne

benjjneb commented 2 years ago

The phyloseq object you have created does not contain sample metadata, hence the sample_data(ps) command cannot return anything and you get this error.

Some guidance on that on the phyloseq site: https://joey711.github.io/phyloseq/merge.html#merge_phyloseq

In short, you'll need to read in your sample metadata into a data.frame with rows that are samples and columns that are different sample variables. This is often stored in the form of a spread sheet or a tsv/csv file. If in a spreadsheet, exports it as a csv and it can be read into a data.frame in R with read.csv.

lynneitelson commented 2 years ago

Thanks! sample_data(ps)$is.neg <- sample_data(ps)$Sample_or_Control == "Control Sample" contamdf.prev <- isContaminant(ps, method="prevalence", neg="is.neg") table(contamdf.prev$contaminant)

How do I implement the positive control (mock community) in the package?

Lynne

benjjneb commented 2 years ago

There is no method in decontam that uses positive controls to identify contaminants. You can read about the two approaches for contaminant identification (frequency and prevalence) implemented by decontam in the paper introducing the method: https://doi.org/10.1186/s40168-018-0605-2

lynneitelson commented 2 years ago

Thank you! `` ps2 <- phyloseq(sample_data(metadata_decontem), otu_table(seqtab, taxa_are_rows=FALSE), tax_table(taxa1))

head(sample_data(ps))

Inspect library size

df_decontem <- as.data.frame(sample_data(ps2)) # Put sample_data into a ggplot-friendly data.frame df_decontem$LibrarySize <- sample_sums(ps2) df_decontem <- df_decontem[order(df_decontem$LibrarySize),] df_decontem$Index <- seq(nrow(df_decontem)) df_decontem$Sample_or_control <- metadata_decontem$Sample_or_Control

ggplot(data = df_decontem, aes(x=Index, y=LibrarySize, color = Sample_or_control)) + geom_point()

sample_data(ps2)$is.neg <- sample_data(ps2)$Sample_or_Control == "Control Sample" contamdf.prev <- isContaminant(ps2, method="prevalence", neg="is.neg") table(contamdf.prev$contaminant) head(which(contamdf.prev$contaminant)) `` I got the following outputs, what might of gone wrong along the way? image

Thank you,

Lynne

lynneitelson commented 2 years ago

image

The table also had a lot of NA's in it