caporaso-lab / student-microbiome-project

Central repository for data and analysis tools for the StudentMicrobiomeProject.
9 stars 3 forks source link

OTU filtering process #1

Closed floresg closed 11 years ago

floresg commented 11 years ago

1). Split each OTU into sample and control tables to determine which negative control otus to substract a). echo "split_otu_table.py -i /home/gifl2111/scratch/SMP/split_otu_table_by_body_site/Forehead/Ref_only/forehead_ref_only_otu_table.biom -o /home/gifl2111/scratch/SMP/split_otu_table_by_body_site/Forehead/Ref_only/control_sample_split/ -m /home/gifl2111/scratch/SMP/split_otu_table_by_body_site/Forehead/Ref_only/Forehead_MF_All_weeks.txt -f Type" | qsub -keo -N forerefsplit -q memroute -l pvmem=8gb (job id=23764) b). repeat for other 3 body habitats 2). Convert .biom of each control OTU table to view a). echo "convert_biom.py -i /home/gifl2111/scratch/SMP/split_otu_table_by_body_site/Forehead/Ref_only/control_sample_split/forehead_ref_only_otu_table_control.biom -o /home/gifl2111/scratch/SMP/split_otu_table_by_body_site/Forehead/Ref_only/control_sample_split/forehead_ref_only_otu_table_control.txt -b --header_key=taxonomy" | qsub -keo -N fcontxt (job id=23768) b). repeat for other 3 body habitats 3). View otu tables and determine which otus to remove based on abundance (sum across negative controls to be greater than or equal to 0.5% of total negative control sequences) a). for each body site see excel document to see which sample IDs were included and which otu ids were selected for removal 4). Remove negative control OTUs from original otu table and generate two otu tables – one to work with, the other containing otus removed from samples a). echo "filter_otus_from_otu_table.py -i /home/gifl2111/scratch/SMP/split_otu_table_by_body_site/Forehead/Ref_only/forehead_ref_only_otu_table.biom -e /home/gifl2111/scratch/SMP/split_otu_table_by_body_site/Forehead/Ref_only/forehead_CR_otus_to_remove_0.5.txt -o /home/gifl2111/scratch/SMP/split_otu_table_by_body_site/Forehead/Ref_only/forehead_otu_neg_removed_0.5.biom" | qsub -keo -N frfilter3 (job id=23865) b).echo "filter_otus_from_otu_table.py -i /home/gifl2111/scratch/SMP/split_otu_table_by_body_site/Forehead/Ref_only/forehead_ref_only_otu_table.biom -e /home/gifl2111/scratch/SMP/split_otu_table_by_body_site/Forehead/Ref_only/forehead_CR_otus_to_remove_0.5.txt -o /home/gifl2111/scratch/SMP/split_otu_table_by_body_site/Forehead/Ref_only/forehead_otu_neg_removed_0.5_ALL.biom --negate_ids_to_exclude" | qsub -keo -N frfilter4 (job –d=23866) c). repeat for other 3 body habitats 5). Remove negative controls from otu table a). these are the working otu tables 6). Remove sample IDs from mapping file with less than 10K sequences.

floresg commented 11 years ago

Link to direct PCR paper http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0044563

antgonza commented 11 years ago

I have a concern and an idea about this. Concern: This kind of analyses are relative abundance so removing one group or groups of bacteria should move the general distribution of the sample(s). I think that if we do this we should also remove part of the other groups to make it even, is this right or am I missing something. Idea: We can use this dataset to see how important is to do the filtering, maybe without the filtering we will get similar alpha/beta/gamma(?) results; as far as I can tell this was not done in the paper, right?

gregcaporaso commented 11 years ago

Concern: This kind of analyses are relative abundance so removing one group or groups of bacteria should move the general distribution of the sample(s). I think that if we do this we should also remove part of the other groups to make it even, is this right or am I missing something.

I'm not sure I'm following this - doesn't just removing those groups adjust the abundances of the others as if everything but the removed groups were present?

Idea: We can use this dataset to see how important is to do the filtering, maybe without the filtering we will get similar alpha/beta/gamma(?) results; as far as I can tell this was not done in the paper, right?

I do like the idea of doing a comparison here to see how these groups affect our diversity estimates. It sounds like they have the potential to based on Noah and Gilbert's description, but I think it is worth checking. Things we could check would be correlation in beta diversity distance matrices with the Mantel test (compare_distance_matrices.py -m mantel) and correlations in alpha diversity (compare_alpha_diversity.py, although we're waiting on a bug fix in this one).

antgonza commented 11 years ago

Concern: I'm not sure how this actually works and that is what I'm trying to understand. Let's put it this way, this filtering is per sample and not per study bases, right? My concern is that removing certain taxa from some samples might change the patterns on the samples filtered and change the resulting distribution, making the original (contaminated but correct?) study distribution different. Another way to see it is that we have distribution_sample1+distribution_sample2+distribution_sample3 ... = study_distribution and removing something from the distribution of just sample1 and sample2 will change the study_distribution as we are not removing similar things from all the samples. I think that it will be fantastic to figure out how to do this because currently in sourcetracker we just remove the contaminated samples from the study as at this point we do not know how to remove correctly the taxa that is making the sample that is from body site A look like is from body site B. Another thing that came to my mind on this issue is that the errors in the samples are random; which doesn't make sense to me, as this comes from a machine that supposedly is calibrated so we could estimate the error of the machine with a couple of runs and then use this info to clean the samples, right? Another way to put it is: the machine is like a ruler that help us measure microbial communities and we know our ruler is not right so we take a few measurements, then take the same measurement with another ruler that we trust and then just remove that error from all our other measurements. In my perspective having random error means that there is something else going on, not sure what, maybe processing?

Idea: That should work

Sorry for the long text but want to understand the ideas behind this better.

floresg commented 11 years ago

I understand your concerns regarding the filtering of otu's but I don't see a way around it. As I mentioned on Friday, it is really only a problem for low biomass samples (palm and forehead) or with samples where swabbing may have been inadequate. There are numerous skin samples that are mostly contaminant otu's. If we did not filter the otu's these samples would make it through rarefaction and we would have skin samples dominated by Mycoplasma which is not real. As far as skewing the distribution of "real" otu's, I will post some otu tables (in excel) later today that are the counts of the negative control otus's that were removed. As you will see, for real samples where sampling was adequate there are not many sequences that get removed. For others, there are several thousand that get removed. Most of those don't make it past rarefaction. I honestly think doing the comparisons between filtered and unfiltered may be useful to satisfy your curiosity but there is no way we can use the data without filtering.

antgonza commented 11 years ago

I have no idea about how contamination happens in the laboratory or how the samples can have Mycoplasma so I went to the source of all knowledge http://en.wikipedia.org/wiki/Mycoplasma and I found interesting that they say that is a common contaminant in the lab; any opinions on this? Obviously it will be nice to figure this out for future runs, right? Now, other solution is to drop those "contaminated" samples, as we do with sourcetracker, without having to filter out some and not other, at the end of the day those samples get dropped due to rarefaction, right?

floresg commented 11 years ago

They only get dropped through rarefaction if we subtract the otus.

gregcaporaso commented 11 years ago

I agree that the filtering of OTUs is essential here, but I would like to compare to satisfy curiosity (I want to know what evaluations are affected and what aren't). Antonio, note that this is contamination of specific OTUs from the direct PCR kits (right Gilbert?), not random sequencing error.

gregcaporaso commented 11 years ago

Gilbert, a couple of clarifications on how you performed this:

3). View otu tables and determine which otus to remove based on abundance (sum across negative controls to be greater than or equal to 0.5% of total negative control sequences)

Are you summing each OTU count across all negative control samples (so in other words combining all into a single "combined" sample), then dividing each OTU count in the "combined" sample by the total sequence count in the "combined" sample to compute the abundance of each OTU in the negative controls? I'd like to replicate this with a couple of QIIME or python commands so it's easier to explain to others.

6). Remove sample IDs from mapping file with less than 10K sequences.

This isn't being done any more, right (now that we have the master mapping file in Google Docs)?

floresg commented 11 years ago

Greg, in response to your questions: yes, I am summing them to generate a single "combined" sample and computing the 0.5% abundance from that. However, look at the "Negative_otus_removed_all_sites.xlsx" in the "Issue_1_OTU_filtering" directory to see which negative control sample IDs were used. Some of the negative controls had duplicated sample ID's and some where from the original week 1 extractions and were not used in the filtering. For each body habitat, there are a list of sample IDs used.

For step 6, this has been done on the filtered otu table that we have been using and is available through GitHub. The otu table I sent you last week (or the week before) does not have them or the positive controls filtered.

gregcaporaso commented 11 years ago

I've posted some code for performing this filtering process. One modification that I made to @floresg's workflow is that the median abundance of each OTU is computed, and that value is compared to the min_abundance threshold. So, if you define your threshold as -a 0.05, an OTU will be flagged for filtering if it's median abundance across the control samples is greater than or equal to 0.05.

I'm now going to run this on the unfiltered OTU tables in the repository to generate the list of OTU ids that should be excluded as determined by this script, and we can then compare that with the lists obtained previously.

gregcaporaso commented 11 years ago

I've run this code on the full OTU table now with the following command (code revision used is linked here and contains a description of how the filtering works):

python ./filter_control_otus.py -i /Users/caporaso/outbox/smp-otu-filter/closed_ref_otu_table.biom -o /Users/caporaso/outbox/smp-otu-filter/out/ -m /Users/caporaso/outbox/StudentMicrobiomeProject-map.tsv -a 0.005 -s "PersonalID:SwabBlank,NTC"

I'm not clear on why this was previously run on a per-body-site basis, but I may not be fully understanding the controls. @floresg, can you explain?

In this run, the OTUs to filter are these:

$ cat filtered_otus.txt 
#OTU ID Taxonomy (if any)
64550   [u'k__Bacteria', u'p__Proteobacteria', u'c__Betaproteobacteria', u'o__Burkholderiales', u'f__Comamonadaceae', u'g__Acidovorax', u's__']
92490   [u'k__Bacteria', u'p__Proteobacteria', u'c__Alphaproteobacteria', u'o__Sphingomonadales', u'f__Sphingomonadaceae', u'g__Novosphingobium', u's__']
209237  [u'k__Bacteria', u'p__Tenericutes', u'c__Mollicutes', u'o__Mycoplasmatales', u'f__Mycoplasmataceae', u'g__Mycoplasma', u's__CandidatusMycoplasmahaemobos']
128686  [u'k__Bacteria', u'p__Proteobacteria', u'c__Gammaproteobacteria', u'o__Alteromonadales', u'f__Idiomarinaceae', u'g__Idiomarina', u's__Idiomarinaloihiensis']
114673  [u'k__Bacteria', u'p__Tenericutes', u'c__Mollicutes', u'o__Mycoplasmatales', u'f__Mycoplasmataceae', u'g__Mycoplasma', u's__Mycoplasmawenyonii']
166908  [u'k__Bacteria', u'p__Proteobacteria', u'c__Gammaproteobacteria', u'o__Enterobacteriales', u'f__Enterobacteriaceae', u'g__', u's__']
262838  [u'k__Bacteria', u'p__Proteobacteria', u'c__Gammaproteobacteria', u'o__Enterobacteriales', u'f__Enterobacteriaceae', u'g__Serratia', u's__Serratiamarcescens']
101552  [u'k__Bacteria', u'p__Proteobacteria', u'c__Betaproteobacteria', u'o__Burkholderiales', u'f__', u'g__Aquabacterium', u's__']
floresg commented 11 years ago

@gregcaporaso The reason I ran this one a per body site basis was because each body habitat was run on a single Illumina lane using one batch of reagents. We were concerned that different batches of reagents may have different contaminants. With that said, if you compare the otu's listed above with the ones I removed (in the Issue 1 folder on the home page), they are identical to the 8 removed from palms and foreheads and the major players from gut and tongues. However, there are other otus from gut (n=1, ID 25543) and tongue (n=4, IDs 194243, 287789, 556492, and 593848) that are not removed.

gregcaporaso commented 11 years ago

OK, that makes sense, thanks for the explanation. Given that they're mostly the same, would it be worth doing this across body habitats as I've done so we're treating all of them the same? That also makes processing the full OTU table clearer.

floresg commented 11 years ago

With your filtering approach, the one otu that is not removed from the gut may result in some samples having more than 10k sequences which will change if they are included or not. This otu is a Psuedomonad. I don't think it will make a difference for tongue samples. Let me think about this and talk with Noah about it and I'll get back to you later today.

floresg commented 11 years ago

Ok, we can go with the filtering of the combined otu table as @gregcaporaso has done - I looked at the gut samples and the Pseudomonas otu is not an issue. Maybe what would be useful would be to add columns to the mapping file of # of sequences pre filtering, # of sequences post filtering, and % of sequences filtered out? If we do go with this otu table then everything will need to be redone (we would need to remove the alpha diversity values that are currently in the mapping file) including the time series filtering as some samples may now have more than 10k sequences.

gregcaporaso commented 11 years ago

Yes, agreed. Given that we're re-generating the OTU tables, what are people's thoughts about re-picking OTUs against the new version of Greengenes.

antgonza commented 11 years ago

I do not see why not, except the extra processing but might be worth it to get better tax assignments.

gregcaporaso commented 11 years ago

That's what I'm thinking.

floresg commented 11 years ago

I am not very familiar with the new gg release but what would it really add here? We already know that the difference between open-reference and closed-reference otu picking is a few million sequences of varying taxonomic resolution (mostly unclassified bacteria) per body habitat so I'm assuming the new gg release would fall somewhere between there. I am just afraid that repicking otus will delay things a couple of weeks further because you will then want to rerun the mislabeling identification process. What do you mean by "better tax assignments?" Are the taxa assignments of the otus from the previous gg release different in the new release?

floresg commented 11 years ago

After mulling it over, I think we should continue working with the existing otu tables. I think the manual filtering I did gives the same results as Greg's automated approach and re-picking otus will not add much value here unless there is something about the new gg release that I don't know? The only thing different between Greg's and my filtering is that his was done on a combined otu table and mine was done on a per body site basis.

gregcaporaso commented 11 years ago

The taxonomy was improved in the new Greengenes, and some sequences change their taxonomic assignment. It also contains nearly twice as many sequences, so should result in fewer sequences failing to hit the reference, which should translate to more samples having greater than 10k sequences, and therefore more samples being included in the analysis.

As for the filtering, the other differences are that we have tested code to apply this process now which is publicly accessible, so it's easy to re-run should we need to, and it's much easier to describe in the the methods (i.e., "we ran this script which you can find here" versus a much longer description like the one in the first comment on this issue) so others can be confident that if they want to run it they're doing it the same way that we did.

It would delay things to re-do these steps, but not by a couple of weeks. If Rob's cluster is available for this, I could likely reproduce all of the analyses we've done so far this Friday.

What do others think about this?

rob-knight commented 11 years ago

I am willing to make the cluster available for this and agree that having a reproducible OTU table based on tested code is highly desirable.

Rob

On Dec 5, 2012, at 2:35 PM, Greg Caporaso notifications@github.com<mailto:notifications@github.com> wrote:

The taxonomy was improved in the new Greengenes, and some sequences change their taxonomic assignment. It also contains nearly twice as many sequences, so should result in fewer sequences failing to hit the reference, which should translate to more samples having greater than 10k sequences, and therefore more samples being included in the analysis.

As for the filtering, the other differences are that we have tested code to apply this process now which is publicly accessible, so it's easy to re-run should we need to, and it's much easier to describe in the the methods (i.e., "we ran this script which you can find herehttps://gist.github.com/4197809/bf1df4cae8f742777736ee08d3c1c4471753996a" versus a much longer description like the one in the first comment on this issue) so others can be confident that if they want to run it they're doing it the same way that we did.

It would delay things to re-do these steps, but not by a couple of weeks. If Rob's cluster is available for this, I could likely reproduce all of the analyses we've done so far this Friday.

What do others think about this?

— Reply to this email directly or view it on GitHubhttps://github.com/gregcaporaso/student-microbiome-project/issues/1#issuecomment-11061649.

floresg commented 11 years ago

Ok re-run everything against the new gg set and your automated negative control filtering. Before you do this, do you want to remove sample IDs in response to Issue #18? I can send a list asap?

gregcaporaso commented 11 years ago

OK, sounds good. It's rare that we actually want to do this, but I think we'll want to remove the samples for the individual with the paperwork issue and the duplicate sample IDs. Can you email me the list of those sample IDs and remove them from the mapping file on Google Docs?

Just to confirm, we're all agreed that we'll use the OTU tables that I generate in the re-run for all analyses? Just want to make sure that we're all on the same page because it's going to take me most of the day to re-run all of this.

floresg commented 11 years ago

I am working on the list now - do you want it to include the sampleIDs from ambiguous time points? The new otu table will still be closed reference, correct?

gregcaporaso commented 11 years ago

Yes, still closed reference. could you create a list of the ambiguous time points separately? it seems like we'd still want to include those in the non-timeseries analyses.

floresg commented 11 years ago

@gregcaporaso I just noticed in the code above that for the PersonalID values you only included "SwabBlank and NTC" as the input values. There are also blanks labeled ntc and ExtBlank that should be included.

gregcaporaso commented 11 years ago

Good catch - thanks! So when I do the filtering on Friday, the filtering should be based on columns with ExtBlank, SwabBlank, NTC, and ntc? Is there a difference between NTC and ntc, or should we just map all of those to a single case for consistency?

floresg commented 11 years ago

There is no difference between ntc and NTC. There are some ExtBlank and SwabBlank ids we may want to exclude - I'll get back to you on that tomorrow, Thursday.

floresg commented 11 years ago

Just remembered the original week 1 samples - do you want to include those when you re-pick outs?

gregcaporaso commented 11 years ago

Yes, I think we should include those in the OTU picking, but filter to a separate OTU table and mapping file (we can make that a new tab in the google doc). I think that will provide some useful technical information (run to run variability) but is not biologically interesting, so no point in include in the majority of the analyses. Agree?

gregcaporaso commented 11 years ago

OK, OTU picking is complete, so am all set to start re-processing. The new OTU tables will be available tomorrow (GitHub is still the old data right now!).

Some notes on seq/sample summaries from the new run versus the old run (i.e., Greengenes 12_10 v Greengenes 4feb2011) are here. Briefly, the median on a per-body-site basis is ~300-500 sequences higher per samples, so this will likely rescue a few samples that were previously falling below our 10k seqs/sample threshold.