Observation Metadata is lost or mangled by normalize_table.py

mlangill commented 9 years ago

When running normalize_table.py the "taxonomy" observation metadata is not properly encoded in the output file.

1) When using the -a DESeq2 the taxonomy data is not included at all in the output file.

2) When using the -a CSS option the taxonomy data is broken into individual observation metadata for each taxonomy level:

Observation Metadata Categories: taxonomy2; taxonomy3; taxonomy1; taxonomy6; taxonomy7; taxonomy4; taxonomy5

jairideout commented 9 years ago

@sowe9385 can you please take a look?

sowe9385 commented 9 years ago

Sure. For 1.) taxonomy is purposely not included, since the output table includes negatives. I'll look more into 2.) and let you know when we find out why.

mlangill commented 9 years ago

@sowe9385 Just because the table includes negatives why should the metadata not be included?

sowe9385 commented 9 years ago

Well, it wouldn't make much sense to make taxa plots with negatives. There isn't a good solution for the negatives at the moment.

mlangill commented 9 years ago

Although currently taxa plots in qiime might not work well with negative values, other tools may be used to construct coherent plots, and these would still want to access the taxonomy information.

antgonza commented 9 years ago

Sorry for input so late on the game. Should negatives be drop? Also @mlangill do you have examples of those tools? Just wondering how to incorporate into qiime ...

mlangill commented 9 years ago

I think the output should not be altered in anyway and the observation and sample data be left as is. Leave it up to the user if they want to replace negative values with zeros or if they want to collapse by taxonomy levels. I don't think this function should be trying to guess what users might be doing with the data downstream.

Personally, I am just starting to explore these methods (CSS and deseq2) and compare them with rarefaction/subsampling to a similar number of sequences per sample.

Even though there are negative values this data can be used to generate PCA plots, bar plots, box plots, etc. in general programs like R or even Excel. I am also testing them out in STAMP.

Hope this clarifies things :)

gregcaporaso commented 9 years ago

I agree with @mlangill - the metadata that is associated with an OTU shouldn't change because of an OTU's count.

Negatives could always be filtered (e.g., with filter_otus_from_otu_table.py --min_count 0) if you wanted to use the tables with QIIME. The way this currently works seems to be making an unnecessary assumption about how the data will be used.

sowe9385 commented 9 years ago

ok - we'll change this. However, please be aware that if you replace the negatives you neglect rare species, and will distort proportions in taxonomy plots/bar plots. Addition of a constant to the matrices is a worse option that is not mathematically justified - given that the values are log-like transformed.

DESeq was not necessarily developed with extremely sparse microbial data in mind, and is usually used with Euclidean distance metrics. Hence, it is not recommended to do more than PCoA/biplots/heatmaps with this data, and it would be a good idea try more than one normalization technique and compare the results. Thanks for the feedback!

gregcaporaso commented 9 years ago

@sowe9385, would you be able to change this so taxonomy is always retained for OTUs (regardless of their count) and have a pull request submitted by Monday? We'd like to get this into QIIME 1.9.1. Please let us know ASAP if you'll be able to do this. Thank you!

sowe9385 commented 9 years ago

I did this a while ago (add taxonomy on to DESeq output) - so issue is already fixed. Thanks.

On Fri, May 1, 2015 at 12:06 PM, Greg Caporaso notifications@github.com wrote:

@sowe9385 https://github.com/sowe9385, would you be able to change this so taxonomy is always retained for OTUs (regardless of their count) and have a pull request submitted by Monday? We'd like to get this into QIIME 1.9.1. Please let us know ASAP if you'll be able to do this. Thank you!

— Reply to this email directly or view it on GitHub https://github.com/biocore/qiime/issues/1929#issuecomment-98189850.

gregcaporaso commented 9 years ago

Can you link that the pull request here?

sowe9385 commented 9 years ago

Think it is #1889 https://github.com/sowe9385/qiime/commit/559b5eea26a3f2e851c6c22a986bdde37eb29d64 'fix DESeq2 taxonomy issue'

On Fri, May 1, 2015 at 3:44 PM, Greg Caporaso notifications@github.com wrote:

Can you link that the pull request here?

— Reply to this email directly or view it on GitHub https://github.com/biocore/qiime/issues/1929#issuecomment-98247165.

gregcaporaso commented 9 years ago

@sowe9385, this doesn't seem to work for me with DESeq2. I start with this BIOM table, and run the following:

# confirm that taxonomy is present in the input file
$ biom summarize-table -i otu_table.biom
Num samples: 9
Num observations: 419
Total count: 1337
Table density (fraction of non-zero values): 0.168

Counts/sample summary:
 Min: 146.0
 Max: 150.0
 Median: 149.000
 Mean: 148.556
 Std. dev.: 1.257
 Sample Metadata Categories: None provided
 Observation Metadata Categories: taxonomy
...

# normalize the table
$ normalize_table.py -i otu_table.biom -o normed-table.DESeq2.biom -a DESeq2

# no taxonomy is present in the output biom table
$ biom summarize-table -i normed-table.DESeq2.biom
Num samples: 9
Num observations: 419
Total count: 962
Table density (fraction of non-zero values): 1.000

Counts/sample summary:
 Min: 85.673790999999994
 Max: 128.280688
 Median: 103.910
 Mean: 106.891
 Std. dev.: 13.465
 Sample Metadata Categories: None provided
 Observation Metadata Categories: None provided

Can you confirm this behavior or let me know what I'm doing wrong?

gregcaporaso commented 9 years ago

Also, using that same BIOM table with CSS, taxonomy is retained, but it is being modified to include NA in cases where the input has fewer taxonomic levels than the output. For example:

# "NA" is not present in the input
$ grep -c NA otu_table.biom
0

# normalize the table
$ normalize_table.py -i otu_table.biom -o normed-table.CSS.biom -a CSS

# "NA" is present in 42 OTUs taxonomy in the output
$ grep -c NA normed-table.CSS.biom
42

This code shouldn't be doing this. Is this coming from your code, in which case we'd like it to be updated to not do that, or from the R biom package, in which case we'll just deal with it for now?

sowe9385 commented 9 years ago

What do you mean by 'fewer taxonomic levels than the output'? I'm pretty sure this is an R package issue unfortunately.

On Mon, May 4, 2015 at 9:33 AM, Greg Caporaso notifications@github.com wrote:

Also, using that same BIOM table with CSS, taxonomy is retained, but it is being modified to include NA in cases where the input has fewer taxonomic levels than the output. For example:

"NA" is not present in the input

$ grep -c NA otu_table.biom 0

normalize the table

$ normalize_table.py -i otu_table.biom -o normed-table.CSS.biom -a CSS

"NA" is present in 42 OTUs taxonomy in the output

$ grep -c NA normed-table.CSS.biom 42

This code shouldn't be doing this. Is this coming from your code, in which case we'd like it to be updated to not do that, or from the R biom package, in which case we'll just deal with it for now?

— Reply to this email directly or view it on GitHub https://github.com/biocore/qiime/issues/1929#issuecomment-98754561.

gregcaporaso commented 9 years ago

fewer taxonomic levels

For example, if you had two OTUs in your BIOM table with the taxonomic assignments:

OTU1: ['k__bac', 'p_cyano']
OTU2: ['k__bac']

the output BIOM table would have:

OTU1: ['k__bac', 'p_cyano']
OTU2: ['k__bac', 'NA']

sowe9385 commented 9 years ago

ok, thought so. Yes, definitely an R package issue.

On Wed, May 6, 2015 at 7:14 AM, Greg Caporaso notifications@github.com wrote:

fewer taxonomic levels

For example, if you had two OTUs in your BIOM table with the taxonomic assignments:

OTU1: ['k__bac', 'p_cyano'] OTU2: ['k__bac']

the output BIOM table would have:

OTU1: ['k__bac', 'p_cyano'] OTU2: ['k__bac', 'NA']

— Reply to this email directly or view it on GitHub https://github.com/biocore/qiime/issues/1929#issuecomment-99454079.

jairideout commented 9 years ago

@gregcaporaso and I investigated this more today and couldn't find a way to keep the correct formatting for taxonomy metadata in CSS or DESeqs2-normalized tables (the issue is with the underlying R packages used to perform the normalization). So for now, taxonomy metadata isn't included in DESeq2 tables and CSS tables have taxonomy padded with "NA" as described above.

I added a description of a workaround to get the original taxonomy metadata added to a normalized table (https://github.com/biocore/qiime/pull/2030). Let's keep this issue open since the actual problem hasn't been solved.

jnpaulson commented 9 years ago

@sowe9385 Just pointed this out to me re: the padded 'NA's. I'll take a look at this and have a solution w/in the next few weeks.

jnpaulson commented 9 years ago

This is fixed in metagenomeSeq version: 1.11.11+.

Best,

biocore / qiime

Observation Metadata is lost or mangled by normalize_table.py #1929

"NA" is not present in the input

normalize the table

"NA" is present in 42 OTUs taxonomy in the output