Removing chloroplast ASVs & discrepancy between assignTaxonomy & IdTaxa

pdcountway commented 5 years ago

Hi - I'm sequencing cyanobacteria with a 'cyano-specific' primer set. As expected, a large fraction of the 629 ASVs were assigned as Chloroplast (n = 178) by assignTaxonomy using the silva_nr_v132_train_set database. I also ran IdTaxa with the DECIPHER package but there were many fewer taxa assigned to Chloroplast (n=75) using the SILVA_SSU_r132_March2018.RData reference database.

Firstly, how do I remove the chloroplast ASVs and update the seqtab_nochim table? I've seen some of the previous discussions about this but it's not clear how the seqtab_nochim table gets updated. Any ideas on why there is such a large discrepancy between the two taxonomic assignment methods?

In general, the sequences seem to be good quality and read length (most were 378-382 bp after primer trimming). There were initially 2011 ASVs which dropped to 629 after chimera detection, but 92% of the sequences remained.

Thanks for any suggestions! -Pete

benjjneb commented 5 years ago

Hi Pete, You would just remove the appropriate columns from the sequence table (a matrix) using R commands. For example, let's say your data was generated something like below:

seqtab <- makeSequenceTable(mergers)
tax <- assignTaxonomy(seqtab, "path/to/silva_nr_v132_train_set.fq.gz")

Then are the chlorosplasts in the genus column? If so, the following should work, if they are assigned at a different rank make the necssary shift:

is.chloro <- tax[,"Genus"] %in% "Chloroplast"
setab.nochloro <- seqtab[,!is.chloro]

That second command just removes all columns corresponding to ASVs assigned to "Chloroplast" at the genus rank.

Any ideas on why there is such a large discrepancy between the two taxonomic assignment methods?

The are different methods and will give different results. IdTaxa is a bit more conservative about assigning taxonomic names, so it isn't suprising it assigned fewer ASVs as Chloroplasts, although the size of the difference is larger than what I usually see.

pdcountway commented 5 years ago

Thanks very much for the simple explanation Ben. I will say that IdTaxa was very fast and seems to agree well with the results of assignTaxonomy for known cyanos, but I'll probably stick with assignTaxonomy since I'm relying on this process to 'weed out' the probable chloroplast sequences. Cheers, Pete

pdcountway commented 5 years ago

Hi Ben - Your R code worked to pull out the chloroplast ASVs and write the new seqtab.nochloro table, but I lost the sample names that had previously populated the left-most column of the seqtab - leaving only sequential sample numbers, e.g., 1-28. How do I keep sample names? Thanks!

benjjneb commented 5 years ago

Hi Peter, What I'd suggest is to store the samples names as the rownames of the sequence table. This keeps the matrix format, and will keep the sample names around when the matrix is manipulated, and lets you index into the matrix by the samples names. That is also how the standard dada2 workflow operates, and what downstream tools will expect, so it will save you headaches later as well. For example:

rownames(seqtab) <- c("sample1", "sample2", "sample3")
seqtab["sample1",]

will name the rows by the sample names, and will return the vector of ASV abundances for sample1.

pdcountway commented 4 years ago

Hi Peter, What I'd suggest is to store the samples names as the rownames of the sequence table. This keeps the matrix format, and will keep the sample names around when the matrix is manipulated, and lets you index into the matrix by the samples names. That is also how the standard dada2 workflow operates, and what downstream tools will expect, so it will save you headaches later as well. For example:
rownames(seqtab) <- c("sample1", "sample2", "sample3")
seqtab["sample1",]
will name the rows by the sample names, and will return the vector of ASV abundances for sample1.

Hi @benjjneb , would you mind expanding on the example of storing the sample names as row names of the sequence table? How would you "index into the matrix" to reference these stored names for use in the final seqtab.nochloro ASV file?

benjjneb commented 4 years ago

I'm confused about what is going wrong at this point. Can you clarify? Maybe with code that is reaching a point where things aren't working as expected.

pdcountway commented 4 years ago

I'm confused about what is going wrong at this point. Can you clarify? Maybe with code that is reaching a point where things aren't working as expected.

Hi @benjjneb - I get what you're saying about storing the sample names as the row names for future reference but don't know how to implement this in my batch script. I guess I'm asking how to do this in a more 'generic' way, for the dozens of unique sample names in my data set and also incorporate that process with the two lines of code to remove chloroplast sequences. I don't think that you're suggesting in the example you provided that I need to hard code all of the sample names, e.g., sample1, sample2, sample3 etc...are you? Also, once the chloroplast sequences are removed and you have the seqtab.nochloro table, would you recommend running assignTaxonomy and addSpecies again, or is there more efficient way of doing this? Thanks, Pete

benjjneb commented 4 years ago

I don't think that you're suggesting in the example you provided that I need to hard code all of the sample names, e.g., sample1, sample2, sample3 etc...are you?

Nope definitely not. The sample names should come from the filenames of the samples you are processing. Typically, that happens automatically during the dada2 workflow. My confusion is where this is going wrong in your code. It would be helpful if you could show the code you are running, and identify when it is that the sample names are being lost. This is in part because the code I posted above to remove the chloroplast sequences should not remove the samples names, so I don't understand where a change is happening in your code.

pdcountway commented 4 years ago

Hi @benjjneb, The batch .R script for submitting dada2 jobs is attached - Note: I had to add the .txt file extension to the attached scripts in order to upload them. This script is served up via PBS Pro to our supercomputer using a short .sh submission script (also attached with added .txt file extension) containing details on the number of processors and amount of memory to use as well as the location of the dada2. The batch .R script includes a block of code at the end for writing a new ASV file without chloroplast sequences per your suggestion. The problem is that the resulting chloroplast-weeded ASV file lacks reference to the original sample names, instead just giving them sequential numbers as names (e.g., 1, 2, 3...). I have not been able to determine where to paste in the 'rownames(seqtab) <- etc....' line to store these row names as samples names. I'm guessing that this is relatively simple fix - or that something in the batch script isn't quite right causing a loss of the names. On a separate topic, I suspect that it is also possible to merge the final (weeded) seqtab file with the assignTaxonomy and/or addSpecies taxa files - but for now I'm just merging the .csv output files in Excel. As a side-note, using this batch script speeds up dada2 incredibly on our supercomputer. Jobs that used to take ~60-70 hours (in interactive mode) now complete in under an hour and yield the same results. Thanks for any insights that you may have! dada2_nochloro.R.txt submit_dada2_nochloro.sh.txt

benjjneb commented 4 years ago

@pdcountway The issue is the following store/save sequence you are using:

write.csv(taxa, file = "/path_to_cutadapted_files/cutadapt/ASV_taxa_sp.csv")
taxa_addSpecies <- read.csv("/path_to_cutadapted_files/cutadapt/ASV_taxa_sp.csv")
# AND
write.csv(nchar(getSequences(seqtab.nochim)), file = "/path_to_cutadapted_files/cutadapt/seqtab.nochim_nchar.csv")
seqtab.nochim_cyano <- read.csv("/path_to_cutadapted_files/cutadapt/seqtab.nochim.csv")

This write.csv/read.csv cycle does not result in the exact same object as what was originally in R. There is loss of various things like knowledge of data types and what exactly to do about rownmaes.

If you want to store/load things in R while getting back the exact same thing you saved, you should use saveRDS/readRDS. Replace write.csv/read.csv in the above, and your issue here goes away.

Or, if this is a self-contained R script, just drop the store/load part in the final code block entirely, and make copies of the R objects still in memory (e.g. taxa_addSpecies <- taxa).

befriendabacterium commented 2 years ago

Hi Ben

Would it be possible to make a quick mention of the default 'keeping chloroplasts' behaviour and how to remove them in the DADA2 tutorial, please? Would just flag it up to anyone who wants to remove them, which I suspect might be most people (though I may be wrong there)?

Thanks! Matt

benjjneb / dada2

Removing chloroplast ASVs & discrepancy between assignTaxonomy & IdTaxa #797