normalize_table.py bug/doc fixes

jairideout commented 9 years ago

The following changes need to be made to normalize_table.py prior to the 1.9.0 release:

[x] script description says to test for OTU correlation, but script output description says not to do this. The documentation should be updated to clarify which downstream analyses are appropriate and which are not.
[x] the R package and R script uses DESeq2, but the QIIME script references DESeq in its documentation. Is this correct?
[x] the QIIME 454 overview tutorial OTU table here fails with DESeq, but works with CSS (normalize_table.py -i otu_table.biom -a DESeq -o DESeq_normalized_otu_table.biom). We need a script usage example for DESeq and to figure out why DESeq doesn't work with this table. Now that DESeq works with this table, we need a new unit test added to exercise this case.
[x] when the script fails, a valid BIOM table is created as output. If there is a script failure, it should not create output (especially output that looks valid).

@sowe9385 can you please work on this?

@gregcaporaso and I will continue adding to this list (we're still testing out this new script).

cc @antgonza

sowe9385 commented 9 years ago

ok, please remove the bit about not testing for OTU correlations with this script.
Yes, we are using DESeq2, but we are using the variance stabilization method which was originally implemented in DESeq. -Please add 'floor' and 'ceiling' to the DESeq2.r script (as in other email) it will fix the issue. -yes, if the script fails we should indeed remove the temporary json file.

Sophie

On Tue, Dec 23, 2014 at 2:46 PM, Jai Ram Rideout notifications@github.com wrote:

The following changes need to be made to normalize_table.py prior to the 1.9.0 release:

script description says to test for OTU correlation, but script output description says not to do this. The documentation should be updated to clarify which downstream analyses are appropriate and which are not.

the R package and R script uses DESeq2, but the QIIME script references DESeq in its documentation. Is this correct?

the QIIME 454 overview tutorial OTU table here https://github.com/biocore/qiime/blob/master/qiime_test_data/beta_diversity/otu_table.biom fails with DESeq, but works with CSS (normalize_table.py -i otu_table.biom -a DESeq -o DESeq_normalized_otu_table.biom). We need a script usage example for DESeq and to figure out why DESeq doesn't work with this table.

when the script fails, a valid BIOM table is created as output. If there is a script failure, it should not create output (especially output that looks valid).

@sowe9385 https://github.com/sowe9385 can you please work on this?

@gregcaporaso https://github.com/gregcaporaso and I will continue adding to this list (we're still testing out this new script).

— Reply to this email directly or view it on GitHub https://github.com/biocore/qiime/issues/1810.

gregcaporaso commented 9 years ago

@sowe9385, we're going to need your help to work on this as we don't feel comfortable making substantial edits as we're not familiar with this code. We're planning to release a "release candidate" today, after which no new functionality will be going into QIIME before 1.9.0, but where documentation can be updated, tests can be added, and bugs can be fixed. We'd like to have you work on making these changes there, once that goes out.

These changes would need to be in by January 6th. Does that sound doable? We can help by email in the meantime.

Note that we still need test_differential_abundance.py to pass before we can build our release candidate - that is a separate issue.

sowe9385 commented 9 years ago

yes, sounds doable. thanks.

On Tue, Dec 23, 2014 at 3:26 PM, Greg Caporaso notifications@github.com wrote:

@sowe9385 https://github.com/sowe9385, we're going to need your help to work on this as we don't feel comfortable making substantial edits as we're not familiar with this code. We're planning to release a "release candidate" today, after which no new functionality will be going into QIIME before 1.9.0, but where documentation can be updated, tests can be added, and bugs can be fixed. We'd like to have you work on making these changes there, once that goes out.

These changes would need to be in by January 6th. Does that sound doable? We can help by email in the meantime.

Note that we still need test_differential_abundance.py to pass before we can build our release candidate - that is a separate issue.

— Reply to this email directly or view it on GitHub https://github.com/biocore/qiime/issues/1810#issuecomment-68005629.

gregcaporaso commented 9 years ago

Thank you!

alk224 commented 9 years ago

I have 1.9.0rc1 installed and tried to run this script. I'm not there yet as it wanted some more stuff installed within R:

install.packages("optparse") source (http://bioconductor.org/biocLite.R"); biocLite("metagenomeSeq")

Will report back when all this installation completes (is taking a while).

...

Still failing. The end of the installation from within R of metagenomeseq and what are presumably dependencies "had non-zero exit status." I ran R with sudo, so it shouldn't have been anything permission-related.

After reading the waste not want not paper, I tried installing metagenomeseq on my production computer (linux OS). I have never got it working right. I have a windows computer right next to it and after two days of failing in linux I thought to try it on the windows computer. Everything worked there without issue. No idea why. Really want this on the linux system as sending OTU tables (converted to txt and put on USB drive to move to next computer) out of qiime is very time consuming.

Just to be clear, the metagenomeseq error relates only to the CSS normalization. I get a similar error when passing -a DESeq (library DESeq2 not found).

jairideout commented 9 years ago

Thanks for these details @alk224! Note that several R packages are required by some of QIIME's scripts (normalize_table.py, differential_abundance.py, compare_categories.py, etc.). See this section of our install docs for the R commands you'll need to run to get all of these packages installed.

Regarding the metagenomeseq installation error, can you please post the full error message that you received? Also, what version of R do you have installed?

alk224 commented 9 years ago

I have R 3.1.2. Ran through the R installation section you indicated (I thought this was part of the pip install).

1) install.packages(c('ape', 'biom', 'optparse', 'RColorBrewer', 'randomForest', 'vegan')) 2) source('http://bioconductor.org/biocLite.R') 3) biocLite(c('DESeq2', 'metagenomeSeq')) 4) q()

During step 3, I got the following errors:

Cannot find xml2-config ERROR: configuration failed for package ‘XML’

removing ‘/home/andy/R/x86_64-pc-linux-gnu-library/3.1/XML’

ERROR: dependency ‘XML’ is not available for package ‘annotate’

removing ‘/home/andy/R/x86_64-pc-linux-gnu-library/3.1/annotate’ ERROR: dependency ‘XML’ is not available for package ‘gridSVG’
removing ‘/home/andy/R/x86_64-pc-linux-gnu-library/3.1/gridSVG’

ERROR: dependencies ‘annotate’, ‘XML’ are not available for package ‘GSEABase’

removing ‘/home/andy/R/x86_64-pc-linux-gnu-library/3.1/GSEABase’ ERROR: dependency ‘annotate’ is not available for package ‘genefilter’
removing ‘/home/andy/R/x86_64-pc-linux-gnu-library/3.1/genefilter’ ERROR: dependency ‘annotate’ is not available for package ‘geneplotter’
removing ‘/home/andy/R/x86_64-pc-linux-gnu-library/3.1/geneplotter’

ERROR: dependencies ‘GSEABase’, ‘genefilter’, ‘annotate’ are not available for package ‘Category’

removing ‘/home/andy/R/x86_64-pc-linux-gnu-library/3.1/Category’

ERROR: dependencies ‘gridSVG’, ‘XML’, ‘Category’ are not available for package ‘interactiveDisplay’

removing ‘/home/andy/R/x86_64-pc-linux-gnu-library/3.1/interactiveDisplay’ ERROR: dependencies ‘genefilter’, ‘geneplotter’ are not available for package ‘DESeq2’
removing ‘/home/andy/R/x86_64-pc-linux-gnu-library/3.1/DESeq2’ ERROR: dependency ‘interactiveDisplay’ is not available for package ‘metagenomeSeq’
removing ‘/home/andy/R/x86_64-pc-linux-gnu-library/3.1/metagenomeSeq’

Looks like most things are tied to a failure about an xml package. Will try to manually install just that.

alk224 commented 9 years ago

running install.packages("XML") gives the following:

Cannot find xml2-config ERROR: configuration failed for package ‘XML’

removing ‘/home/andy/R/x86_64-pc-linux-gnu-library/3.1/XML’

I corrected this by installing xml from ubuntu repository: sudo apt-get install r-cran-xml

normalize_table.py now tries to run, but it errors with the same problem whether I pass CSS or DESeq:

normalize_table.py -i raw_otu_table_min1000.biom -o DESeq_normalized_table.biom -a DESeq /usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py:2499: VisibleDeprecationWarning: rank is deprecated; use the ndim attribute or function instead. To find the rank of a matrix see numpy.linalg.matrix_rank. VisibleDeprecationWarning) Traceback (most recent call last): File "/usr/local/bin/normalize_table.py", line 125, in main() File "/usr/local/bin/normalize_table.py", line 116, in main normalize_DESeq(input_path, out_path, DESeq_negatives_to_zero) File "/usr/local/lib/python2.7/dist-packages/qiime/normalize_table.py", line 80, in normalize_DESeq run_DESeq(json_infile, out_path, DESeq_negatives_to_zero) File "/usr/local/lib/python2.7/dist-packages/qiime/normalize_table.py", line 114, in run_DESeq app_result = rsl(command_args=command_args, script_name='DESeq2.r') File "/usr/local/lib/python2.7/dist-packages/qiime/util.py", line 2006, in call % (''.join(open(errfilepath, 'r').readlines()))) burrito.util.ApplicationError: Unacceptable application exit status: 1, command: cd "/media/sf_VM_Shared/fungi_rdp_test/open_reference_output/"; R --slave --args --source_dir /usr/local/lib/python2.7/dist-packages/qiime/support_files/R -i DESeq_normalized_table_json.biom -o DESeq_normalized_table.biom < /usr/local/lib/python2.7/dist-packages/qiime/support_files/R/DESeq2.r Program output:

Error in colnames<-(*tmp*, value = c("r", "c", "data")) : 'names' attribute [3] must be the same length as the vector [0] Calls: DESeq2 ... biom_data -> biom_data -> biom_data -> biom_data -> colnames<- Execution halted

I'm not really sure how to address this error.

alk224 commented 9 years ago

Possibly this stackoverflow posting is appropriate to debugging this error

http://stackoverflow.com/questions/22208064/error-in-colnames-tmp-attempt-to-set-colnames-on-an-object-with-les

I'd guess there is a problem with the biom import step. I ran through this process manually in R and it works OK, though I had to manually add my taxonomy field back to my biom table, so I think there are potential format issues here.

alk224 commented 9 years ago

One other tidbit, the instructions for metagenomeSeq from the qiime forum here (https://groups.google.com/forum/#!searchin/qiime-forum/cumnorm/qiime-forum/g-qnFj5e96U/ZcSqoLUehgAJ -- see Carly, third response) work perfectly on R in windows, but I never got working in ubuntu. Perhaps a linux-specific problem. I'll try installing some of these things directly in R as from the qiime install page and perhaps metagenomeSeq will begin to cooperate in linux.

alk224 commented 9 years ago

I managed to generate the same error manually using Carly's instructions:

setwd("/media/sf_VM_Shared/fungi_rdp_test/open_reference_output") library("metagenomeSeq") Loading required package: Biobase Loading required package: BiocGenerics Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:stats’:

xtabs

The following objects are masked from ‘package:base’:

anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
do.call, duplicated, eval, evalq, Filter, Find, get, intersect,
is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax,
pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rep.int,
rownames, sapply, setdiff, sort, table, tapply, union, unique,
unlist, unsplit

Welcome to Bioconductor

Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: limma

Attaching package: ‘limma’

The following object is masked from ‘package:BiocGenerics’:

plotMA

Loading required package: interactiveDisplay Loading required package: grid Loading required package: RColorBrewer

library(biom) mybiom = load_biom("raw_otu_table_min1000.biom") Error in colnames<-(*tmp*, value = c("r", "c", "data")) : 'names' attribute [3] must be the same length as the vector [0]

alk224 commented 9 years ago

I emailed the phyloseq developers to see if they have any insight. In the meantime, could it be a biom version issue? If qiime is using an older version of biom, perhaps the new version is being loaded by R and causing some kind of incompatibility? Just a thought. Will let you know if the phyloseq people have any useful insights.

alk224 commented 9 years ago

My configuration was identical to the phyloseq person I corresponded with. A reboot later and I was able to push my otu table through the metagenomeseq command. Then I tried with the normalize_table.py command and it runs without error for both CSS and DESeq. However, the table summaries suggest something strange is happening, at least when DESeq is passed (all values are negative). Here are summaries for the raw table (make_otu_table result), then the table I pushed through Carly's workflow (also odd results, very inflated values), then normalize_table.py -a CSS, then normalize_table.py -a DESeq

andy@Ubuntu14:~$ cat raw_otu_table_min100.summary Num samples: 26 Num observations: 732 Total count: 8316 Table density (fraction of non-zero values): 0.066

Counts/sample summary: Min: 179.0 Max: 483.0 Median: 307.500 Mean: 319.846 Std. dev.: 69.087 Sample Metadata Categories: None provided Observation Metadata Categories: taxonomy

Counts/sample detail: 4N.November2013.D: 179.0 6B.November2013.D: 182.0 3N.November2013.D: 244.0 3S.November2013.D: 248.0 1Z.November2013.D: 266.0 7N.November2013.D: 274.0 2N.November2013.D: 277.0 2Z.November2013.D: 289.0 7B.November2013.D: 293.0 6Z.November2013.D: 294.0 2S.November2013.D: 303.0 4S.November2013.D: 303.0 4B.November2013.D: 305.0 3Ba.November2013.D: 310.0 1B.November2013.D: 327.0 5N.November2013.D: 341.0 7Z.November2013.D: 349.0 1N.November2013.D: 351.0 3Z.November2013.D: 354.0 5Z.November2013.D: 355.0 5S.November2013.D: 364.0 6S.November2013.D: 385.0 6N.November2013.D: 399.0 7S.November2013.D: 410.0 5B.November2013.D: 431.0 2B.November2013.D: 483.0

andy@Ubuntu14:~$ cat norm_otu_table_min100.summary Num samples: 26 Num observations: 732 Total count: 293432 Table density (fraction of non-zero values): 0.066

Counts/sample summary: Min: 7512.8205128205473 Max: 20058.823529411959 Median: 10617.647 Mean: 11285.864 Std. dev.: 2801.039 Sample Metadata Categories: None provided Observation Metadata Categories: None provided

Counts/sample detail: "7B.November2013.D": 7512.820512820547 "5S.November2013.D": 8088.8888888889105 "7N.November2013.D": 8562.500000000047 "1Z.November2013.D": 8580.645161290364 "3Ba.November2013.D": 8857.142857142891 "4B.November2013.D": 9531.250000000053 "3S.November2013.D": 9538.461538461586 "1B.November2013.D": 9617.647058823575 "6N.November2013.D": 9731.707317073207 "2N.November2013.D": 9892.857142857198 "4S.November2013.D": 10100.000000000051 "3Z.November2013.D": 10114.28571428575 "4N.November2013.D": 10529.411764705983 "6B.November2013.D": 10705.882352941273 "6S.November2013.D": 11000.000000000047 "3N.November2013.D": 11090.909090909177 "2Z.November2013.D": 11115.384615384686 "7Z.November2013.D": 11258.064516129096 "1N.November2013.D": 11700.000000000062 "5Z.November2013.D": 12241.379310344888 "6Z.November2013.D": 12782.608695652267 "2B.November2013.D": 15093.750000000122 "2S.November2013.D": 15150.000000000115 "7S.November2013.D": 15185.185185185277 "5B.November2013.D": 15392.857142857225 "5N.November2013.D": 20058.82352941196

andy@Ubuntu14:~$ cat CSS_normalized_table.summary Num samples: 26 Num observations: 732 Total count: 7816 Table density (fraction of non-zero values): 0.066

Counts/sample summary: Min: 155.01939999999999 Max: 462.7287 Median: 297.954 Mean: 300.619 Std. dev.: 68.724 Sample Metadata Categories: None provided Observation Metadata Categories: taxonomy

Counts/sample detail: 3S.November2013.D: 155.0194 5N.November2013.D: 195.7801 4N.November2013.D: 200.33089999999999 6B.November2013.D: 203.5075 6Z.November2013.D: 228.20059999999998 3N.November2013.D: 252.92059999999998 7Z.November2013.D: 275.9481 2S.November2013.D: 279.6786 7B.November2013.D: 280.3367 5Z.November2013.D: 281.88890000000004 2Z.November2013.D: 291.99989999999997 1N.November2013.D: 292.00879999999995 1B.November2013.D: 295.0914 7N.November2013.D: 300.8166 2B.November2013.D: 306.7103 5B.November2013.D: 317.01800000000003 3Z.November2013.D: 320.0203 4B.November2013.D: 324.3599 2N.November2013.D: 331.31609999999995 4S.November2013.D: 331.93280000000004 7S.November2013.D: 351.28549999999996 3Ba.November2013.D: 358.975 1Z.November2013.D: 370.8234 6S.November2013.D: 401.3093 5S.November2013.D: 406.08849999999995 6N.November2013.D: 462.7287

andy@Ubuntu14:~$ cat DESeq_normalized_table.summary Num samples: 26 Num observations: 732 Total count: -8121 Table density (fraction of non-zero values): 1.000

Counts/sample summary: Min: -363.60080999999991 Max: -238.50143999999997 Median: -313.820 Mean: -312.370 Std. dev.: 30.116 Sample Metadata Categories: None provided Observation Metadata Categories: None provided

Counts/sample detail: 4N.November2013.D: -363.6008099999999 3S.November2013.D: -363.40896999999995 6B.November2013.D: -358.38568 5N.November2013.D: -355.87614999999994 6Z.November2013.D: -345.02108999999996 3N.November2013.D: -338.97718 2S.November2013.D: -327.36753999999996 5Z.November2013.D: -320.20921999999996 2Z.November2013.D: -319.66729 1N.November2013.D: -316.61658 1B.November2013.D: -316.5354 7Z.November2013.D: -314.8863 7N.November2013.D: -314.59173999999996 7B.November2013.D: -313.04855 2N.November2013.D: -305.85103 5B.November2013.D: -304.18433999999996 4S.November2013.D: -303.09716 4B.November2013.D: -303.03722999999997 3Z.November2013.D: -302.45805999999993 2B.November2013.D: -297.76669999999996 7S.November2013.D: -290.74272999999994 1Z.November2013.D: -289.53297 3Ba.November2013.D: -289.43957 6S.November2013.D: -267.09531999999996 5S.November2013.D: -261.72846 6N.November2013.D: -238.50143999999997

sowe9385 commented 9 years ago

Thanks for your work on this @alk224! I don't think your DESeq results are that unusual, given the extremely low library size and sparsity of your data set. DESeq was developed more for RNA-Seq data, which usually is much less sparse. Also keep in mind that DESeq uses a log2-like transformation, following division by a few numbers, and could be prone to sparsity artifacts.

Due to your low library size, I would be cautious in interpreting these results, and re-run them if possible. Taxonomy was not added to the DESeq-transformed tables because, given the negatives that result, it would not make sense to do taxonomy plots. DESeq-normalized tables are primarily for usage with PCoA plots and Euclidean metrics.

jairideout commented 9 years ago

Closing this issue as the fixes were addressed in #1821.

biocore / qiime

normalize_table.py bug/doc fixes #1810