NCEAS / open-science-codefest

Web site and planning materials for open science conference.
http://nceas.github.io/open-science-codefest
12 stars 10 forks source link

Automated metadata extraction #21

Open emhart opened 10 years ago

emhart commented 10 years ago

Organizational Page: AutoMeta Category: Coding Title: Automatically extract metadata of R dataframes Proposed by: Ted Hart Participants: Summary: Many datasets that people work with in R have the basics of EML coverage modules already in the data. Can we build a lightweight set of tools that can extract spatial, temporal and taxonomic coverage? Are there other ways to get more metadata out? This should be a short discussion followed the creation of actual tools resulting in a lightweight R package.

karawoo commented 10 years ago

This sounds fun, I'd be up for helping if I can.

mbjones commented 10 years ago

@emhart Automatically extracting coverage metadata would be so useful. I wonder if you could also capture/document the algorithm so that it can be implemented in other languages for tools not written in R? The R implementation would be a great guide for how to do it well. One challenge that you'll need to overcome is differentiating the often multiple coverage fields in a data set. We often find data sets that have multiple temporal fields, representing e.g., date of observation, date of sample processing, date of QA processing, etc. The most useful of these from a discovery perspective is date of observation, which corresponds to EML's temporal coverage field, and we wouldn't want to conflate these different timestamps in a data set. And an analogous set of multiple fields would need to be dealt with for space and taxon coverage too -- its not as simple as just taking any lat/lon pair you find. I'll be interested in what heuristics would help in differentiating these fields and figuring out which ones represent 'coverage'.

chrismattmann commented 10 years ago

This would be great. Consider writing an R module that wraps Apache Tika, http://tika.apache.org/. Tika is the "digital babel fish" and even though it's implemented in Java, it's been bound to several other languages including Julia, Python (emerging support), .NET, and also it provides a JAX-RS web service that allows REST-ful calls to be made to it:

https://wiki.apache.org/tika/TikaJAXRS

Here are the downstream APIs from Tika:

https://wiki.apache.org/tika/API%20Bindings%20for%20Tika

Would love to get R module wrapping Tika and I think it could be a lot of help and synergy with our existing NSF funding for the Polar Cyber Infrastructure program:

http://www.nsf.gov/awardsearch/showAward?AWD_ID=1348450&HistoricalAwards=false

And would potentially be some good synergy here. One immediate step would be to download and try Tika per the website http://tika.apache.org/, and then if you have questions, you can join our mailing list (dev@tika.apache.org/) by sending a blank email to dev-subscribe@tika.apache.org.

jordansread commented 10 years ago

@emhart I would like to help too, but am also interested in this development resulting in a package that goes the other way too. @lawinslow has pointed out the need to be able to attach metadata to timeseries data as part of the processing chain. I don't know of an easy way to do this, but it seems like this could overlap with what you are proposing. For example, the rLakeAnalyzer package (github.com/GLEON/rLakeAnalyzer) uses wind speed, water temps, etc, but there is really no propagation of units once they become data.frames.

mbjones commented 10 years ago

@jread-usgs The EML R package that @cboettig developed has extended data frames (as 'data.set') to allow directly attaching metadata to the components of a data frame. The idea is to keep the metadata close to its related data, without getting in the way. See the discussion of this feature in ropensci/EML#47. Maybe this would be useful in metadata propagation?

chrismattmann commented 10 years ago

Thanks @mbjones hadn't heard of it. Taro.jl does the same thing in the Julia programming language using Apache Tika.

mbjones commented 10 years ago

Taro.jl looks interesting. Am I reading it right that it only includes column headers? The module I was referencing includes full attribute metadata, including attribute names, descriptions, units, precision, enumerated code values, and other details. It also ingests this metadata from associated EML XML files which are common in our discipline. It would be super cool to have a standard way to attach metadata to data frames that is language neutral -- i.e., the syntax would be the same whether you were in R, python, Julia, etc. They all have the idea of a data frame. @cboettig -- do you think we could generalize your method for parsing and attaching rich metadata across languages?

cboettig commented 10 years ago

Good question. I'm not quite sure how it would work. My intuition is that it would be more useful to have programming languages mapping between there native types and the full implementations of a data standard like EML. As you know, the data.set class in our EML package provides just a subset of the EML spec (attributeList level elements).

In working across languages, I would imagine each language having its library to serialize it's native and commonly used classes, along with whatever metadata is either essential and/or can be automatically determined into a data standard format like EML, which serves the role as interface between languages, rather than merely a subset of EML.

using the full EML standard rather than the data.set subset should make conversion between other standards and tools easier, though I suppose RDF route is the other way to go.

Spoke to Hadley about data.set strategies at the ropensci hackathon, who was concerned that it may be difficult to guarantee that a data.set object always behaves like a data.frame, so in the R package I've been moving to a model where a user works mostly with either the full EML R object or a data.frame (and other common R data structures that we can describe in EML).


Carl Boettiger http://carlboettiger.info

sent from mobile device; my apologies for any terseness or typos On Aug 9, 2014 10:52 AM, "Matt Jones" notifications@github.com wrote:

Taro.jl looks interesting. Am I reading it right that it only includes column headers? The module I was referencing includes full attribute metadata, including attribute names, descriptions, units, precision, enumerated code values, and other details. It also ingests this metadata from associated EML XML files which are common in our discipline. It would be super cool to have a standard way to attach metadata to data frames that is language neutral -- i.e., the syntax would be the same whether you were in R, python, Julia, etc. They all have the idea of a data frame. @cboettig https://github.com/cboettig -- do you think we could generalize your method for parsing and attaching rich metadata across languages?

— Reply to this email directly or view it on GitHub https://github.com/NCEAS/open-science-codefest/issues/21#issuecomment-51693680 .

jordansread commented 10 years ago

@mbjones thanks for pointing out that functionality in the EML package. That will go a long ways for us. I will try to keep with with @cboettig 's ideas here and hope to bump into you at codefest.

emhart commented 10 years ago

One concrete thing we can do is implement an algorithm that generates a g-polygon for a set of points. I'm not sure if such a thing exists already, but I'm sure we could hammer this function out over codefest if there's not already an implementation in R.

jordansread commented 10 years ago

@emhart here is one example of 'footprinting' a set of points: http://grokbase.com/t/r/r-sig-geo/1261q5j3mn/concave-hull-of-lat-lon-points

Alternatively, a convex hull is simpler but is often not representative of the sampling area: http://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/chull.html

emhart commented 10 years ago

Cool, thanks @jread-usgs. I figured there was something out there. Another thing I'd like to consider with this idea is where these sets of functions should live. Should it be it's own package? Should it be part of @cboettig 's EML package? I think once we hammer out some of the details on what kinds of md we can extract we should find a good home for them. I'll let Carl weigh in on whether or not it would be a good fit with the EML package.

dlebauer commented 10 years ago

This sounds like something that could be done in connection with the Brown Dog Data Tilling Service, which is an open source tool funded by the NSF DataNet (like DataOne). Based on their use cases, much of their focus will be ecologically relevant data. Thoughts @robkooper?