Open mlangill opened 9 years ago
@sowe9385 can you please take a look?
Sure. For 1.) taxonomy is purposely not included, since the output table includes negatives. I'll look more into 2.) and let you know when we find out why.
@sowe9385 Just because the table includes negatives why should the metadata not be included?
Well, it wouldn't make much sense to make taxa plots with negatives. There isn't a good solution for the negatives at the moment.
Although currently taxa plots in qiime might not work well with negative values, other tools may be used to construct coherent plots, and these would still want to access the taxonomy information.
Sorry for input so late on the game. Should negatives be drop? Also @mlangill do you have examples of those tools? Just wondering how to incorporate into qiime ...
I think the output should not be altered in anyway and the observation and sample data be left as is. Leave it up to the user if they want to replace negative values with zeros or if they want to collapse by taxonomy levels. I don't think this function should be trying to guess what users might be doing with the data downstream.
Personally, I am just starting to explore these methods (CSS and deseq2) and compare them with rarefaction/subsampling to a similar number of sequences per sample.
Even though there are negative values this data can be used to generate PCA plots, bar plots, box plots, etc. in general programs like R or even Excel. I am also testing them out in STAMP.
Hope this clarifies things :)
I agree with @mlangill - the metadata that is associated with an OTU shouldn't change because of an OTU's count.
Negatives could always be filtered (e.g., with filter_otus_from_otu_table.py --min_count 0
) if you wanted to use the tables with QIIME. The way this currently works seems to be making an unnecessary assumption about how the data will be used.
ok - we'll change this. However, please be aware that if you replace the negatives you neglect rare species, and will distort proportions in taxonomy plots/bar plots. Addition of a constant to the matrices is a worse option that is not mathematically justified - given that the values are log-like transformed.
DESeq was not necessarily developed with extremely sparse microbial data in mind, and is usually used with Euclidean distance metrics. Hence, it is not recommended to do more than PCoA/biplots/heatmaps with this data, and it would be a good idea try more than one normalization technique and compare the results. Thanks for the feedback!
@sowe9385, would you be able to change this so taxonomy is always retained for OTUs (regardless of their count) and have a pull request submitted by Monday? We'd like to get this into QIIME 1.9.1. Please let us know ASAP if you'll be able to do this. Thank you!
I did this a while ago (add taxonomy on to DESeq output) - so issue is already fixed. Thanks.
On Fri, May 1, 2015 at 12:06 PM, Greg Caporaso notifications@github.com wrote:
@sowe9385 https://github.com/sowe9385, would you be able to change this so taxonomy is always retained for OTUs (regardless of their count) and have a pull request submitted by Monday? We'd like to get this into QIIME 1.9.1. Please let us know ASAP if you'll be able to do this. Thank you!
— Reply to this email directly or view it on GitHub https://github.com/biocore/qiime/issues/1929#issuecomment-98189850.
Can you link that the pull request here?
Think it is #1889 https://github.com/sowe9385/qiime/commit/559b5eea26a3f2e851c6c22a986bdde37eb29d64 'fix DESeq2 taxonomy issue'
On Fri, May 1, 2015 at 3:44 PM, Greg Caporaso notifications@github.com wrote:
Can you link that the pull request here?
— Reply to this email directly or view it on GitHub https://github.com/biocore/qiime/issues/1929#issuecomment-98247165.
@sowe9385, this doesn't seem to work for me with DESeq2
. I start with this BIOM table, and run the following:
# confirm that taxonomy is present in the input file
$ biom summarize-table -i otu_table.biom
Num samples: 9
Num observations: 419
Total count: 1337
Table density (fraction of non-zero values): 0.168
Counts/sample summary:
Min: 146.0
Max: 150.0
Median: 149.000
Mean: 148.556
Std. dev.: 1.257
Sample Metadata Categories: None provided
Observation Metadata Categories: taxonomy
...
# normalize the table
$ normalize_table.py -i otu_table.biom -o normed-table.DESeq2.biom -a DESeq2
# no taxonomy is present in the output biom table
$ biom summarize-table -i normed-table.DESeq2.biom
Num samples: 9
Num observations: 419
Total count: 962
Table density (fraction of non-zero values): 1.000
Counts/sample summary:
Min: 85.673790999999994
Max: 128.280688
Median: 103.910
Mean: 106.891
Std. dev.: 13.465
Sample Metadata Categories: None provided
Observation Metadata Categories: None provided
Can you confirm this behavior or let me know what I'm doing wrong?
Also, using that same BIOM table with CSS
, taxonomy is retained, but it is being modified to include NA
in cases where the input has fewer taxonomic levels than the output. For example:
# "NA" is not present in the input
$ grep -c NA otu_table.biom
0
# normalize the table
$ normalize_table.py -i otu_table.biom -o normed-table.CSS.biom -a CSS
# "NA" is present in 42 OTUs taxonomy in the output
$ grep -c NA normed-table.CSS.biom
42
This code shouldn't be doing this. Is this coming from your code, in which case we'd like it to be updated to not do that, or from the R biom
package, in which case we'll just deal with it for now?
What do you mean by 'fewer taxonomic levels than the output'? I'm pretty sure this is an R package issue unfortunately.
On Mon, May 4, 2015 at 9:33 AM, Greg Caporaso notifications@github.com wrote:
Also, using that same BIOM table with CSS, taxonomy is retained, but it is being modified to include NA in cases where the input has fewer taxonomic levels than the output. For example:
"NA" is not present in the input
$ grep -c NA otu_table.biom 0
normalize the table
$ normalize_table.py -i otu_table.biom -o normed-table.CSS.biom -a CSS
"NA" is present in 42 OTUs taxonomy in the output
$ grep -c NA normed-table.CSS.biom 42
This code shouldn't be doing this. Is this coming from your code, in which case we'd like it to be updated to not do that, or from the R biom package, in which case we'll just deal with it for now?
— Reply to this email directly or view it on GitHub https://github.com/biocore/qiime/issues/1929#issuecomment-98754561.
fewer taxonomic levels
For example, if you had two OTUs in your BIOM table with the taxonomic assignments:
OTU1: ['k__bac', 'p_cyano']
OTU2: ['k__bac']
the output BIOM table would have:
OTU1: ['k__bac', 'p_cyano']
OTU2: ['k__bac', 'NA']
ok, thought so. Yes, definitely an R package issue.
On Wed, May 6, 2015 at 7:14 AM, Greg Caporaso notifications@github.com wrote:
fewer taxonomic levels
For example, if you had two OTUs in your BIOM table with the taxonomic assignments:
OTU1: ['k__bac', 'p_cyano'] OTU2: ['k__bac']
the output BIOM table would have:
OTU1: ['k__bac', 'p_cyano'] OTU2: ['k__bac', 'NA']
— Reply to this email directly or view it on GitHub https://github.com/biocore/qiime/issues/1929#issuecomment-99454079.
@gregcaporaso and I investigated this more today and couldn't find a way to keep the correct formatting for taxonomy metadata in CSS or DESeqs2-normalized tables (the issue is with the underlying R packages used to perform the normalization). So for now, taxonomy metadata isn't included in DESeq2 tables and CSS tables have taxonomy padded with "NA" as described above.
I added a description of a workaround to get the original taxonomy metadata added to a normalized table (https://github.com/biocore/qiime/pull/2030). Let's keep this issue open since the actual problem hasn't been solved.
@sowe9385 Just pointed this out to me re: the padded 'NA's. I'll take a look at this and have a solution w/in the next few weeks.
This is fixed in metagenomeSeq version: 1.11.11+.
Best,
When running normalize_table.py the "taxonomy" observation metadata is not properly encoded in the output file.
1) When using the
-a DESeq2
the taxonomy data is not included at all in the output file.2) When using the
-a CSS
option the taxonomy data is broken into individual observation metadata for each taxonomy level: