Open emhart opened 10 years ago
This sounds fun, I'd be up for helping if I can.
@emhart Automatically extracting coverage metadata would be so useful. I wonder if you could also capture/document the algorithm so that it can be implemented in other languages for tools not written in R? The R implementation would be a great guide for how to do it well. One challenge that you'll need to overcome is differentiating the often multiple coverage fields in a data set. We often find data sets that have multiple temporal fields, representing e.g., date of observation, date of sample processing, date of QA processing, etc. The most useful of these from a discovery perspective is date of observation, which corresponds to EML's temporal coverage field, and we wouldn't want to conflate these different timestamps in a data set. And an analogous set of multiple fields would need to be dealt with for space and taxon coverage too -- its not as simple as just taking any lat/lon pair you find. I'll be interested in what heuristics would help in differentiating these fields and figuring out which ones represent 'coverage'.
This would be great. Consider writing an R module that wraps Apache Tika, http://tika.apache.org/. Tika is the "digital babel fish" and even though it's implemented in Java, it's been bound to several other languages including Julia, Python (emerging support), .NET, and also it provides a JAX-RS web service that allows REST-ful calls to be made to it:
https://wiki.apache.org/tika/TikaJAXRS
Here are the downstream APIs from Tika:
https://wiki.apache.org/tika/API%20Bindings%20for%20Tika
Would love to get R module wrapping Tika and I think it could be a lot of help and synergy with our existing NSF funding for the Polar Cyber Infrastructure program:
http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348450&HistoricalAwards=false
And would potentially be some good synergy here. One immediate step would be to download and try Tika per the website http://tika.apache.org/, and then if you have questions, you can join our mailing list (dev@tika.apache.org/) by sending a blank email to dev-subscribe@tika.apache.org.
@emhart I would like to help too, but am also interested in this development resulting in a package that goes the other way too. @lawinslow has pointed out the need to be able to attach metadata to timeseries data as part of the processing chain. I don't know of an easy way to do this, but it seems like this could overlap with what you are proposing. For example, the rLakeAnalyzer package (github.com/GLEON/rLakeAnalyzer) uses wind speed, water temps, etc, but there is really no propagation of units once they become data.frames.
@jread-usgs The EML R package that @cboettig developed has extended data frames (as 'data.set') to allow directly attaching metadata to the components of a data frame. The idea is to keep the metadata close to its related data, without getting in the way. See the discussion of this feature in ropensci/EML#47. Maybe this would be useful in metadata propagation?
Thanks @mbjones hadn't heard of it. Taro.jl does the same thing in the Julia programming language using Apache Tika.
Taro.jl looks interesting. Am I reading it right that it only includes column headers? The module I was referencing includes full attribute metadata, including attribute names, descriptions, units, precision, enumerated code values, and other details. It also ingests this metadata from associated EML XML files which are common in our discipline. It would be super cool to have a standard way to attach metadata to data frames that is language neutral -- i.e., the syntax would be the same whether you were in R, python, Julia, etc. They all have the idea of a data frame. @cboettig -- do you think we could generalize your method for parsing and attaching rich metadata across languages?
Good question. I'm not quite sure how it would work. My intuition is that it would be more useful to have programming languages mapping between there native types and the full implementations of a data standard like EML. As you know, the data.set class in our EML package provides just a subset of the EML spec (attributeList level elements).
In working across languages, I would imagine each language having its library to serialize it's native and commonly used classes, along with whatever metadata is either essential and/or can be automatically determined into a data standard format like EML, which serves the role as interface between languages, rather than merely a subset of EML.
using the full EML standard rather than the data.set subset should make conversion between other standards and tools easier, though I suppose RDF route is the other way to go.
Spoke to Hadley about data.set strategies at the ropensci hackathon, who was concerned that it may be difficult to guarantee that a data.set object always behaves like a data.frame, so in the R package I've been moving to a model where a user works mostly with either the full EML R object or a data.frame (and other common R data structures that we can describe in EML).
Carl Boettiger http://carlboettiger.info
sent from mobile device; my apologies for any terseness or typos On Aug 9, 2014 10:52 AM, "Matt Jones" notifications@github.com wrote:
Taro.jl looks interesting. Am I reading it right that it only includes column headers? The module I was referencing includes full attribute metadata, including attribute names, descriptions, units, precision, enumerated code values, and other details. It also ingests this metadata from associated EML XML files which are common in our discipline. It would be super cool to have a standard way to attach metadata to data frames that is language neutral -- i.e., the syntax would be the same whether you were in R, python, Julia, etc. They all have the idea of a data frame. @cboettig https://github.com/cboettig -- do you think we could generalize your method for parsing and attaching rich metadata across languages?
— Reply to this email directly or view it on GitHub https://github.com/NCEAS/open-science-codefest/issues/21#issuecomment-51693680 .
@mbjones thanks for pointing out that functionality in the EML package. That will go a long ways for us. I will try to keep with with @cboettig 's ideas here and hope to bump into you at codefest.
One concrete thing we can do is implement an algorithm that generates a g-polygon for a set of points. I'm not sure if such a thing exists already, but I'm sure we could hammer this function out over codefest if there's not already an implementation in R.
@emhart here is one example of 'footprinting' a set of points: http://grokbase.com/t/r/r-sig-geo/1261q5j3mn/concave-hull-of-lat-lon-points
Alternatively, a convex hull is simpler but is often not representative of the sampling area: http://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/chull.html
Cool, thanks @jread-usgs. I figured there was something out there. Another thing I'd like to consider with this idea is where these sets of functions should live. Should it be it's own package? Should it be part of @cboettig 's EML package? I think once we hammer out some of the details on what kinds of md we can extract we should find a good home for them. I'll let Carl weigh in on whether or not it would be a good fit with the EML package.
This sounds like something that could be done in connection with the Brown Dog Data Tilling Service, which is an open source tool funded by the NSF DataNet (like DataOne). Based on their use cases, much of their focus will be ecologically relevant data. Thoughts @robkooper?
Organizational Page: AutoMeta Category: Coding Title: Automatically extract metadata of R dataframes Proposed by: Ted Hart Participants: Summary: Many datasets that people work with in R have the basics of EML coverage modules already in the data. Can we build a lightweight set of tools that can extract spatial, temporal and taxonomic coverage? Are there other ways to get more metadata out? This should be a short discussion followed the creation of actual tools resulting in a lightweight R package.