Closed danchurch closed 8 years ago
Just to check, when you indicate the files produced are nonsense and not human-readable, are they being output as BIOM 2.1? It's built on top of HDF5 and is binary. Are you receiving any error output?
On Sat, Jun 25, 2016 at 7:39 PM, danchurch notifications@github.com wrote:
I am doing an ITS region fungal study. I used the usearch v8.1 pipeline, without any qiime scripts, to get me to the point of a biom table with taxonomic assignments. As far as I can tell, the json biom table that was outputted by usearch is not readable by any downstream applications, including biom, either python scripts or in R environment. I believe this may be due to the structure of the taxonomic metadata. So I edited them to try to match examples on the biom website. However, the taxonomic info is still not reading correctly.
The original output from usearch 8.1:
""" { "id":"otu_95_sub.biom", "format": "Biological Observation Matrix 1.0", "format_url": "http://biom-format.org", "generated_by": "usearch8.11803", "type": "OTU table", "date": "Thu Jun 23 20:38:04 2016", "matrix_type": "sparse", "matrix_element_type": "float", "shape": [767176,98], "rows":[ {"id":"OTU1575410:M01498:244:000000000-ANT97:1:1101:18025:5338:160", "metadata":{"taxonomy":"d:Fungi(0.9991),p:Ascomycota(0.5394),c:Sordariomycetes(0.3516),o:Xylariales(0.1885),f:Xylariaceae(0.0879),g:Xylaria(0.0105)"}},
"""
After my scripting, rows look like this:
""" "shape": [767176,98], "rows":[ {"id":"OTU1575410", "metadata":{"taxonomy": ["kFungi", "pAscomycota", "cSordariomycetes", "oXylariales", "fXylariaceae", "gXylaria"]}},
"""
I believe all brackets have been properly closed. Biom will happily convert to a classical OTU table from this for me when I do not include taxonomic info, but every other biom command generates a nonsense file, not human-readable, and not readable by phyloseq. No error message is returned.
Let me know if this row format seems erroneous, please. If it does not, I will look elsewhere for the source of the error. If you are curious, I am keeping a log of my pipeline on my own github account.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/biocore/biom-format/issues/701, or mute the thread https://github.com/notifications/unsubscribe/AAc8skXcampemtMvPRqnOJxAQpLSOQ9_ks5qPeZugaJpZM4I-eO- .
Hmm. Not sure. That would explain a lot, but I still can't seem to get anything to read in phyloseq or R-environment biom. When I use python biom scripts to convert or add metadata, I do not get error messages:
biom convert -i otu_95_sub_relab2.biom -o otu_95_sub_relab2.HDF5.biom --to-hdf5
As mentioned above, output is not human readable, but no error messages. When this file is then passed to phyloseq or biom in R, R crashes with a long error message.
Both
aa <- read_biom('otu_95_sub_relab2.HDF5.biom')
and
bb <- import_biom('otu_95_sub_relab2.HDF5.biom')
Produce the following (apologies for large mass of text):
* caught segfault * address 0x7ffdd18c00f8, cause 'memory not mapped'
Traceback: 1: .Call("_H5Dread", h5dataset@ID, sidFile, sidMem, buf, compoundAsDataFrame, bit64conv, PACKAGE = "rhdf5") 2: H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, compoundAsDataFrame = compoundAsDataFrame, ...) 3: doTryCatch(return(expr), name, parentenv, handler) 4: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 5: tryCatchList(expr, classes, parentenv, handlers) 6: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = stderr()) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))}) 7: try({ obj <- H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, compoundAsDataFrame = compoundAsDataFrame, ...)}) 8: h5readDataset(h5dataset, index = index, start = start, stride = stride, block = block, count = count, compoundAsDataFrame = compoundAsDataFrame, ...) 9: h5read(h5loc, L[[i]]$name, ...) 10: h5loadData(group, L[[i]], all = all, ...) 11: h5loadData(group, L[[i]], all = all, ...) 12: h5loadData(loc$H5Identifier, L, all = all, ...) 13: h5dump(gid, start = start, stride = stride, block = block, count = count, compoundAsDataFrame = compoundAsDataFrame, callGeneric = callGeneric, ...) 14: h5read(biom_file, "/", read.attributes = TRUE) 15: read_hdf5_biom(biom_file) 16: read_biom("otu_95_sub_relab2.HDF5.biom")
Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace Selection: 3 Warning message: In strsplit(msg, "\n") : input string 1 is invalid in this locale daniel@daniel-Satellite-S855:~/Documents/Taiwan_data/wood/woodbiom$
Perhaps this is just a memory problem. I am doing this on my personal machine while I wait for our computing cluster to update phyloseq and biom modules in R. Do you know if phyloseq will work with these new compiled hdf5 tables? From which version of phyloseq?
Does biom summarize-table
work on it?
cc @joey711 in the event this is related to phyloseq parsing of BIOM 2.1
See the FAQ post here:
I don't see versions of files or packages in your post, so hard for me to diagnose. Note that phyloseq has migrated to a new Bioconductor package called "biomformat". The "biom" package on CRAN should be considered deprecated, and I should have it taken down and point to biomformat.
If you google around there are also phyloseq issues tracker posts related to this as well. If this is an obscure incarnation of the biom-format definition, then I can't promise that biom-format will work right away. The typical QIIME output of biom-format-HDF5 definitely does.
Ideally you post a MRE that includes data to reproduce the error, and (assuming it still doesn't work) then we can diagnose and consider supporting/fixing this special case.
@ Joey711. Thank you for speaking up. Not asking for a diagnosis from you, and don't believe this is a special case, it is a usearch 8.1 json output, modified as above. Originally asked the biom folks to look at my taxonomic metadata formatting in a JSON table to see if they noticed any problems. Haven't received an answer on that, though folks have been amazingly helpful and the issue seems to have evolved a bit. For you specifically, I hadn't gotten to posting on your github yet, but here we are: Does phyloseq read biom HDF5 2.1 binary tables? If so, starting with which version?
Your FAQ and related issues examples, and other google search results do not address this, only that some form of HDF5 table will be supported starting april 2016.
It isn't clear if there is or is not an issue with the modifications made to the taxonomy. Custom editing json files without parser support is generally a risky idea. Please provide the output from biom summarize-table so we can determine if the file parses as expected. It may also be useful to run biom validate-table. On Jun 28, 2016 6:56 PM, "danchurch" notifications@github.com wrote:
@ Joey711. Thank you for speaking up. Not asking for a diagnosis from you, and don't believe this is a special case, it is a usearch 8.1 json output, modified as above. Originally asked the biom folks to look at my taxonomic metadata formatting in a JSON table to see if they noticed any problems. Haven't received an answer on that, though folks have been amazingly helpful and the issue seems to have evolved a bit. For you specifically, I hadn't gotten to posting on your github yet, but here we are: Does phyloseq read biom HDF5 2.1 binary tables? If so, starting with which version?
Your FAQ and related issues examples, and other google search results do not address this, only that some form of HDF5 table will be supported starting april 2016.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/biocore/biom-format/issues/701#issuecomment-229235830, or mute the thread https://github.com/notifications/unsubscribe/AAc8siDOHrkU33wQrE8pnKgbDFi2j5kxks5qQdC0gaJpZM4I-eO- .
@danchurch that's fine. I was trying to be proactive in case there is a new parsing support needed. Your issue is still not that clear, as I don't know what version of phyloseq you used to attempt file parsing. The latest release of phyloseq does support the v2.0 HDF5 of the biom-format, through a package I released separately on BioC called "biomformat".
I haven't tested on biom V2.1 yet. That version is news to me. @wasade if you want have any tips on testing files or where to look for changes, I'd be happy to look into it.
Cheers
btw, JSON format should be fine unless there's something really weird going on. If it is simply an unusual taxonomy convention, that should be fine. phyloseq exposes functional arguments to define custom taxonomy parsing.
Thanks, @joey711. Format spec is here:
http://biom-format.org/documentation/format_versions/biom-2.1.html
2.0 was out in the wild for only a very brief time, and I'm not aware of any tools which produce it; we released it shortly after 2.0. Best place to monitor changes is this repo. The examples in the repo and test data include valid 2.1 files On Jun 28, 2016 9:06 PM, "Paul J. McMurdie" notifications@github.com wrote:
@danchurch https://github.com/danchurch that's fine. I was trying to be proactive in case there is a new parsing support needed. Your issue is still not that clear, as I don't know what version of phyloseq you used to attempt file parsing. The latest release of phyloseq does support the v2.0 HDF5 of the biom-format, through a package I released separately on BioC called "biomformat".
I haven't tested on biom V2.1 yet. That version is news to me. @wasade https://github.com/wasade if you want have any tips on testing files or where to look for changes, I'd be happy to look into it.
Cheers
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/biom-format/issues/701#issuecomment-229251198, or mute the thread https://github.com/notifications/unsubscribe/AAc8sh1JSyJeYS4O1ZqDZIYpGn8DQ1Q5ks5qQe9RgaJpZM4I-eO- .
Thanks @wasade
In that case, since we posted and released biomformat on BioC in the last 9 months or so, V2.1 is probably the version we're already supporting.
As for the OP issue, are there any test/example files (or a definition) for the JSON file output from usearch?
Can't speak to what usearch does but format spec is here:
http://biom-format.org/documentation/format_versions/biom-1.0.html
biom validate-table
will check the spec.
On Jun 28, 2016 9:33 PM, "Paul J. McMurdie" notifications@github.com
wrote:
Thanks @wasade https://github.com/wasade
In that case, since we posted and released biomformat on BioC in the last 9 months or so, V2.1 is probably the version we're already supporting.
As for the OP issue, are there any test/example files (or a definition) for the JSON file output from usearch?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/biom-format/issues/701#issuecomment-229253928, or mute the thread https://github.com/notifications/unsubscribe/AAc8ssL9eDlPXRjZNivBk-NZZ8lSNMezks5qQfWzgaJpZM4I-eO- .
Right, thanks. biomformat supports that as well (when the package got started as biom on CRAN). I was more curious what fiddly detail usearch is tripping on in its output. I'm familiar with its unique (but well documented) taxonomy convention... I guess I should just support that as an option somewhere...
Apologies for being unclear. I guess I meant in my question, "Does anything stick out to you as problematic in the example row metadata I have shown you that may cause problems with parsing a json-formatting to create a HDF5 biom table or phyloseq data object?"
But thank you for your interest in what may be much deeper issues behind my problems, some of which are probably due to my modifying the json tables on my own. I greatly appreciate the help. So in answer to your questions... prepare yourself for a long post....
On my personal computer:
biom --version
biom, version 2.1.5
R environment:
sessionInfo()
R version 3.3.1 (2016-06-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.4 LTS
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] biom_0.4.0 phyloseq_1.16.0
loaded via a namespace (and not attached): [1] Rcpp_0.12.5 plyr_1.8.4 XVector_0.12.0
[4] iterators_1.0.8 tools_3.3.1 zlibbioc_1.18.0
[7] jsonlite_0.9.20 nlme_3.1-128 rhdf5_2.16.0
[10] gtable_0.2.0 lattice_0.20-33 mgcv_1.8-12
The outputs from biom summarize-table:
biom summarize-table -i otu_95_sub.biom
Num samples: 98 Num observations: 767176 Total count: 986055 Table density (fraction of non-zero values): 0.010
Counts/sample summary: Min: 409.0 Max: 16808.0 Median: 10451.000 Mean: 10061.786 Std. dev.: 3098.718 Sample Metadata Categories: None provided Observation Metadata Categories: taxonomy
Counts/sample detail: 255: 409.0 211: 2695.0 233: 3141.0 180: 3951.0
Etc., etc. All samples are accounted for. No error messages are returned.
The outputs from biom validate-table:
biom validate-table -i otu_95_sub.biom > subbiomvalidate.txt
Invalid format 'Biological Observation Matrix 1.0', must be '1.0.0' 'id' in {'id': '', 'metadata': {'taxonomy': 'd:Fungi(0.9178),p:Basidiomycota(0.2264),c:Agaricomycetes(0.0978),o:Agaricales(0.0707),f:Pleurotaceae(0.0597),g:Pleurotus(0.0086),s:Pleurotus_cystidiosus_SH191285.07FU'}} appears empty Bad value at idx 0: [0, 0, 1] Timestamp does not appear to be ISO 8601 The input file is not a valid BIOM-formatted file.
Note that the above two commands are on the direct outputs of usearch 8.1 as I have used it. Now if we apply to the version that is the result of my scripted modifications:
biom summarize-table -i otu_95_sub_relab2.biom
Num samples: 98 Num observations: 767176 Total count: 986055 Table density (fraction of non-zero values): 0.010
Counts/sample summary: Min: 409.0 Max: 16808.0 Median: 10451.000 Mean: 10061.786 Std. dev.: 3098.718 Sample Metadata Categories: None provided Observation Metadata Categories: taxonomy
Counts/sample detail: 255: 409.0 211: 2695.0 233: 3141.0 180: 3951.0
Etc., etc. No error messages.
biom validate-table -i otu_95_sub_relab2.biom
Invalid format 'Biological Observation Matrix 1.0', must be '1.0.0' 'id' in {'metadata': {'taxonomy': ['kFungi', 'pBasidiomycota', 'cAgaricomycetes', 'oAgaricales', 'fPleurotaceae', 'gPleurotus', 's__Pleurotus_cystidiosus_SH191285.07FU']}, 'id': ''} appears empty Bad value at idx 0: [0, 0, 1] Timestamp does not appear to be ISO 8601 The input file is not a valid BIOM-formatted file.
I feel silly for not having found biom validate-table
before, thanks for bringing that up. In both files, there is a lost otu ID. I do not know why this otu ID was lost, I will look into it. The other errors appear cosmetic, seems like I can fix those. I will write to Rob at drive5 and ask him for input on this also.
Part of the reason I have been asking questions about versions rather than supplying my own info is that I prefer to work on our computing cluster, and they are requesting from me which versions of biom, R, phyloseq, etc, that I require. So I am attempting to construct a working environment with proper versions to handle these files before I go forward. It also turns out that the computing cluster cannot install R 3.3 at this point, so I am trying to figure out my options.
It would appear that the json-formatted file is messed up at least partially as there is an unnamed OTU. I'm not concerned about the format version string or the time stamp. It is possible the "Bad value..." is due to how the table is reporting its data type, but that isn't available in the output. Basically, what could trigger the "Bad value..." is if the table has a dtype=float specified as that value is integer. So, likely minor as well.
Can you try running the HDF5 version of the file through summarize-table
, or was that included above? My guess is that the empty ID is resulting in condition that the subsequent R parser isn't handling correctly, so am curious to see if the Python one is. However, the traceback from above suggests there may be an issue with the HDF5 library on the cluster you're on given the error -- are you able to load other BIOM 2.1 files? You can find valid ones here.
biom convert -i otu_95_sub_relab2.biom -o otu_95_sub_relab2.HDF5.biom --to-hdf5
biom summarize-table -i otu_95_sub_relab2.HDF5.biom
Num samples: 98 Num observations: 767176 Total count: 986055 Table density (fraction of non-zero values): 0.010
Counts/sample summary: Min: 409.0 Max: 16808.0 Median: 10451.000 Mean: 10061.786 Std. dev.: 3098.718 Sample Metadata Categories: None provided Observation Metadata Categories: taxonomy
Counts/sample detail: 255: 409.0 211: 2695.0 233: 3141.0
Etc., etc. No errors.
biom validate-table -i otu_95_sub.biom
The input file is a valid BIOM-formatted file.
With phyloseq, in R environment, the above file ('otu_95_sub_relab2.HDF5.biom') returns the above error in R, begins like this:
> aa <- import_biom('otu_95_sub_relab2.HDF5.biom')
* caught segfault * address 0x7ffc1797d6d8, cause 'memory not mapped'
Traceback: 1: .Call("_H5Dread", h5dataset@ID, sidFile, sidMem, buf, compoundAsDataFrame, bit64conv, PACKAGE = "rhdf5") 2: H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, compoundAsDataFrame = compoundAsDataFrame, ...) 3: doTryCatch(return(expr), name, parentenv, handler) 4: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 5: tryCatchList(expr, classes, parentenv, handlers) 6: tryCatch(expr, error = function(e) { call <- conditionCall(e)
Etc., etc. Full error shown a few posts back.
With one of your suggested sample hdf5 tables:
> aa <- import_biom('min_sparse_otu_table_hdf5.biom')
Error in
colnames<-
(*tmp*
, value = c("ta1", "ta0")) : length of 'dimnames' [2] not equal to array extent In addition: Warning messages: 1: In strsplit(msg, "\n") : input string 1 is invalid in this locale 2: In parseFunction(i$metadata$taxonomy) : Empty taxonomy vector encountered. 3: In parseFunction(i$metadata$taxonomy) : Empty taxonomy vector encountered. 4: In parseFunction(i$metadata$taxonomy) : Empty taxonomy vector encountered. 5: In parseFunction(i$metadata$taxonomy) : Empty taxonomy vector encountered. 6: In parseFunction(i$metadata$taxonomy) : Empty taxonomy vector encountered.
> aa
Error: object 'aa' not found
And another. This one works, with a warning:
> aa <- import_biom('rich_sparse_otu_table_hdf5.biom')
Warning message: In strsplit(msg, "\n") : input string 1 is invalid in this locale
> aa
phyloseq-class experiment-level object otu_table() OTU Table: [ 5 taxa and 6 samples ] sample_data() Sample Data: [ 6 samples by 4 sample variables ] tax_table() Taxonomy Table: [ 5 taxa by 7 taxonomic ranks ]
First, uninstall/remove biom R package. It is deprecated in place of biomformat on BioC. This is what your version of phyloseq should use, and biom should not be loaded in your session. The namespace should be clear enough that this isn't an issue, but let's make sure.
Rather than phyloseq::import_biom()
, try biomformat::read_biom()
on those same files and see what happens. The difference being that read_biom
is more general and not trying to parse very much. There might be some insane index non-matching that is going un-checked, but we need to disambiguate data-derived error from file-system/file/HDF5-format derived error.
Thanks, @danchurch. @joey711, agree with your assessment although I'm not familiar with the internals on that end. Please let me know if there is anything I can do
Okay, I think I've figured out my situation. Two factors were muddying the waters, ones we pretty much identified above, made especially confusing by the fact that I am working on both my own computer and a university computing cluster. The first was that the computing cluster I'm using can't upgrade to R 3.3, so no current version of BioClite/Biomformat/phyloseq. So no ability to handle the HDF5 formats.
The second is that my personal machine, which is up-to-date (see above) with the necessary packages and R, can handle the new formats, theoretically, but any attempt to import these biom tables as phyloseq objects requires too much memory and fails.
So if I ensure that all biom tables remain in json format, all analyses seem to be running smoothly on our computing cluster with R. 3.2.5, phyloseq_1.14.0, and (sorry @joey711) biom_0.3.12.
My scripting seems to make the file more readable to phyloseq. The unmodified output from usearch v8.1 seems to create an unreadable taxonomic data format, at least for me (to see what these biom files outputted from usearch and after my scripts look like, see first message above).
With the raw json biom outputs from usearch, in R 3.2.5/phyloseq_1.14.0/biom_0.3.12:
> bb <- import_biom('otu_95_sub.biom')
Error in i$metadata$taxonomy : $ operator is invalid for atomic vectors
> bb
Error: object 'bb' not found
After my scripting modifications:
>cc <- import_biom('otu_95_sub_relab3.biom')
> cc
phyloseq-class experiment-level object otu_table() OTU Table: [ 767176 taxa and 98 samples ] tax_table() Taxonomy Table: [ 767176 taxa by 7 taxonomic ranks ]
So I have to be careful to keep all my outputs as json files, until the computing cluster updates, or you gentleman can provide backward compatibility for the HDF5 format and bioconductor packages to R versions < 3.3. Not requesting this, just noting.
Anyway, thanks for all of your help, @joey711 @wasade. I'll wait for any responses you might want to give, but for me I think this issue is closed...
Sounds good. This repository is only concerned with defining the format specification, and providing a Python-based in-memory object, so my ability to offer insight or resolve requests in the R realm is non-existent. That, and I've used R only a handful of times.
For what it's worth, I'd be shocked if you weren't able to install R locally on your compute cluster. It is a rare scenario where you need super user privileges. In fact, it looks like quite a bit of the bioconductor suite is available via conda ( https://bioconda.github.io/search.html?q=bioconductor&check_keywords=yes&area=default# ).
On Mon, Jul 11, 2016 at 12:41 PM, danchurch notifications@github.com wrote:
Closed #701 https://github.com/biocore/biom-format/issues/701.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/biom-format/issues/701#event-719136172, or mute the thread https://github.com/notifications/unsubscribe/AAc8sk_jcTN24Oj-rLxR1vVTZkKV5U8aks5qUo5MgaJpZM4I-eO- .
I am doing an ITS region fungal study. I used the usearch v8.1 pipeline, without any qiime scripts, to get me to the point of a biom table with taxonomic assignments. As far as I can tell, the json biom table that was outputted by usearch is not readable by any downstream applications, including biom, either python scripts or in R environment. I believe this may be due to the structure of the taxonomic metadata. So I edited them to try to match examples on the biom website. However, the taxonomic info is still not reading correctly.
The original output from usearch 8.1:
After my scripting, rows look like this:
I believe all brackets have been properly closed. Biom will happily convert to a classical OTU table from this for me when I do not include taxonomic info, but every other biom command generates a nonsense file, not human-readable, and not readable by phyloseq. No error message is returned.
Let me know if this row format seems erroneous, please. If it does not, I will look elsewhere for the source of the error. If you are curious, I am keeping a log of my pipeline on my own github account.