CRAN for Data - Githubissues

lawinslow commented 10 years ago

Organizational Page: CRAN4Data Category: Data and Code Title: CRAN-like package management for data Proposed by: Luke Winslow Participants: Anyone who finds CRAN useful! Summary: CRAN is one of the most successful platforms for scientific code sharing. Could data sharing borrow a page from the CRAN book and create a package format for data?

This package format could leverage pieces of the existing package system to allow for named data packages that could be installed, updated, sourced, and documented. Documentation could be a modified form of the R package documentation system to use computer-readable formats to generate human-readable documentation. A key part of the data package could be a way to include parsing and import code so rich, varied inputs could be used (geotiff, shp, netcdf, etc).

We could discuss a way to increase or remove the size cap on contributed packages so large datasets could be included. Packages could also be code to import data from a different web-based, persistent repository (DataOne, Data.gov, figshare). The packages could be versioned and easily updated. Like CRAN packages, data packages could have both CRAN and other data packages as dependencies that are automatically installed and sourced.

sckott commented 10 years ago

I like the idea. You may want to consider the work ongoing by @maxogden and company on dat - dat will have versioning just like git, and aims to handle larger data sets than git can deal with, and do a lot of data format conversions, etc.

Perhaps building on top of dat would be one way to go. We (ropensci) are going to build an R client/package around dat so that should make integration in R easy.

There's also the OKF data packages http://dataprotocols.org/data-packages/ - though I think dat is a better approach

jordansread commented 10 years ago

dat looks pretty cool (pun intended). There are a couple of interesting things to talk about here. 1) is the issue the need to increment/version at high frequencies, such as the sensor examples Max is talking about? When I read this, I was thinking more about curated data, that would be a bit slower to evolve than the per-measurement commits he references. 2) Is the idea to have simply data packages, or to also include a pseudo-standardaized set of operating tools that are part of the package? 3) going back to dat, or a packaging system which could handle high-frequency sensor data - that would be a pretty big advance for the way we operate on near real-time sensor feeds, because at present there is either a lot of manual appending, download duplication, or stopping analyses arbitrarily at a certain point in time to keep a static dataset. 4) would attribution be a part of this? I think we should also try to push the new scientific attribution units, which can be research paper citations, but should also be tool/package downloads, user stats etc. There should be a few carrots designed to make data sharing a bit more common for high value datasets that are curated over the course of a grad career or as part of project work.

Either way, this is something I am extremely interested in. 5 stars.

lawinslow commented 10 years ago

@sckott, I hadn't heard of dat. That's especially relevant as back when I began thinking about data packages, I had pondered if git could be used as a sort of version management system for the data so updates would not include full downloads, only incremental changes. The OKF data packages is pretty close to what I was thinking, only to have something a little more baked into the R world.

@jread-usgs, great questions.

1) is the issue the need to increment/version at high frequencies, such as the sensor examples Max is talking about? When I read this, I was thinking more about curated data, that would be a bit slower to evolve than the per-measurement commits he references.

I had originally thought more on the curated side as well, but one might argue that they don't necessarily have to be slowly evolving packages, but could be really described as curated and maintained structured connections to data. That data might be static files, or it might be a dat connection to a data firehose.

2) Is the idea to have simply data packages, or to also include a pseudo-standardaized set of operating tools that are part of the package?

I think the beauty of the standard CRAN package is that both documentation, examples, tests, and code come all as a single unit. I think data could benefit from this model as well. Data, coupled with working import code and built-in documentation. So I guess I'm saying there would be some minimum level of operating tools that are part of the package (just like a proper R package has code, documentation and running examples).

3) going back to dat, or a packaging system which could handle high-frequency sensor data...

I totally agree.

4) would attribution be a part of this? I think we should also try to push the new scientific attribution units, which can be research paper citations, but should also be tool/package downloads, user stats etc. There should be a few carrots designed to make data sharing a bit more common for high value datasets that are curated over the course of a grad career or as part of project work.

I would like to see a CRAN-plus model. CRAN is simple and cheap to host, but the lack of stats do make it hard to make a case for the impact of your own work. While RStudio releases the download logs, I agree it would be nice to release more metrics.

mbjones commented 10 years ago

Great topic, and definitely worth exploring. dat is a great (alpha) idea, but is mainly focused on tabular data. DataONE has also focused on a more loosely bound idea of a Data Package, which uses the OAI-ORE standard to create descriptions of resources that make up a Data Package, and that allows mixing data objects of a variety of types (e.g., tabular data with shape files and images). In addition, there are a number of very mature data containers for data in the sciences, including both NetCDF and HDF5 that go far beyond what dat does. We give an overview of these in our documentation of DataONE Data Packages.

max-mapper commented 10 years ago

Hi all, dat has been under heavy development over the last few months in order to add support for storing blobs (large files) in addition to the existing tabular data store.

We are working on shipping a stable version still and as part of that I've been working on guides that try to explain the goals of the project more concretely. One is our data importing guide (work in progress). At the bottom it outlines the different types of modules that can be written on top of dat/using dat.

The high level goal is to present a streaming data replication API that can be easily extended to interface with new file formats, databases or storage backends. There are lots of tradeoffs to consider when choosing how you want to version data, e.g. do you try to denormalize a HDF5 file into a graph of tables in order to get more granular versioning or do you just store the whole file in the blob store and get coarse (file level) versioning? We are trying to leave the options open and just focus on making something that is fast + easy to use.

Once we have our small core in place (soon!) we can start focusing on writing modules on top of dat (we have a wishlist going, feel free to add to it). Our success is dependent on us bootstrapping an ecosystem of really useful modules and data pipeline management tools with dat at the core moving data/metadata around.

mbjones commented 10 years ago

@maxogden Thanks for the clarification. That is awesome that dat will support modules for additional data types. Existing systems like NetCDF support fully parallel data access data with complex data models using their access libraries. Is your streaming API for dat similar to the NetCDF libraries? I think it would be really interesting to compare and contrast how HDF5 and NetCDF work with where you are going with dat. I'd certainly learn a lot from that conversation.

chrismattmann commented 10 years ago

dat sounds interesting. How does it compare to Spark streaming, or the work going on with Storm, and Kafka?

NCEAS / open-science-codefest

CRAN for Data #28