evamaxfield commented 5 years ago

Authors

Name: Jackson Brown
Affiliation: Allen Institute for Cell Science
ORCID:

Keywords

Data Packaging
Large Object Datasets
Data Versioning

Homepage

https://github.com/AllenCellModeling/quilt3distribute

Abstract

A core principal of research is the affordability and ease of reproducing the results found by an experiment and to minimize the challenge of experimental reproducibility, it is common for researchers to share the dataset used to produce the results of an experiment. Methods for managing and distributing these datasets however, are ill-suited for imaging datasets, or more generally: large object datasets, because they commonly resemble a manifest and require additional packaging and organization than their feature set counterparts.

Quilt3Distribute (Q3D) is a software application that enables the distribution of manifest style datasets which can be made of up thousands of individual files.

Full Abstract Available at:

(shown below)

Title: Managing Manifests and Distributing Datasets Date: 09.01.2019 Author: Jackson Brown, Allen Institute for Cell Science Corresponding Author Email: jacksonb@alleninstitute.org

Managing Manifests and Distributing Datasets

Abstract

A core principal of research is the affordability and ease of reproducing the results found by an experiment and to minimize the challenge of experimental reproducibility, it is common for researchers to share the dataset used to produce the results of an experiment. Methods for managing and distributing these datasets however, are ill-suited for imaging datasets, or more generally: large object datasets, because they commonly resemble a manifest and require additional packaging and organization than their feature set counterparts.

Quilt3Distribute (Q3D) is a software application that enables the distribution of manifest style datasets which can be made of up thousands of individual files.

Manifests

There are many ways to store a dataset made up of thousands of files, but often these options have significant associated costs. One such method of packaging and distributing these datasets made popular in recent years, is the containerization of data [2, 4]. Due to the nature of geospatial data, scientists and engineers have been creating file formats, data packaging tools, and distribution systems for many years to solve these exact problems which have ultimately established standards for the packaging and distribution of their fields datasets [4]. Following suit, our proposal for storage and distribution of manifest style datasets uses a software package that falls under the containerization umbrella. "Quilt" is a software package that adds a number of features to current state of the art storage systems (e.g. Amazon's S3) [3]. Of note: Quilt performs versioning of dataset containers (what Quilt refers to as data packages), which is a desirable feature for the scientific resource stack as versioning is critical for reproducibility.

Implementation

This project expands on state of the art methods by specifically targeting the distribution of manifest style datasets. For most scientists at the Allen Institute for Cell Science, datasets are commonly tabular, with a column, or multiple columns, filled with file paths to larger resources (images). The rest of the tabular dataset is largely metadata about the files referenced in each row. To expand on the state of the art, our Python package attaches the metadata from columns found in a tabular manifest style dataset to each file found in the manifest before distributing them with Quilt.

To help the end user of the dataset, our package additionally cleans and standardizes the metadata it finds. A few examples of these cleaning and validation behaviors are:

1) Asserting or casting all values in a metadata column are that of the most common data type for that metadata column. (Ex: if the values "4" (integer four), "1" (integer one), and "'4'" (string four) exist in the same column, the application attempts to cast the instance of the string four to the columns most common data type: an integer)

2) Enforcing that every file path found exists prior to attempting distribution. (Commonly, when collaboratively working on shared datasets, files will move to a new location during the process of development.)

Learning from the extensive work on linked data, retention of file associations is an incredibly valuable tool for not only navigation between objects but also the relationships between those objects [1]. To apply these linked data concepts in our own application, if a manifest has multiple columns that contain file paths to resources, our application additionally constructs a lookup table for those shared file associations and places the relevant portion of the lookup table in each file's metadata. An illustrative example of this behavior is when an imaging manifest has a column for full size images and a column for thumbnail images. For each row of the manifest, the application makes an entry in the Quilt package that pairs the full size image to the smaller thumbnail image by storing the read path to the other in the metadata for each member of the pair.

Quilt3Distribute (Q3D) is an open-source Python application and additional details and documentation can be found at: https://github.com/AllenCellModeling/quilt3distribute. Q3D is currently used as a primary production system to distribute terabytes of imaging datasets from the Allen Institute for Cell Science.

Acknowledgements

We wish to thank the Allen Institute for Cell Science founder, Paul G. Allen, for his vision, encouragement, and support. This work could not have been completed without the additional support and input from all members of the AllenInstitute for Cell Science modeling team.

References

Auer, S. (2011). The emerging web of linked data. Proceedings of the 2011 International Conference on Intelligent Semantic Web-Services and Applications - ISWSA 11. https://doi.org/10.1145/1980822.1980823
Bigdely-Shamlo, N., Makeig, S., & Robbins, K. A. (2016). Preparing Laboratory and Real-World EEG Data for Large-Scale Analysis: A Containerized Approach. Frontiers in Neuroinformatics, 10. https://doi.org/10.3389/fninf.2016.00007
Karve, A., Moore, K., Ryazanov, D., & Mochalov, A. (2018). https://open.quiltdata.com/
Pons, X., & Masó, J. (2016). A comprehensive open package format for preservation and distribution of geospatial data and metadata. Computers & Geosciences, 97, 89–97. https://doi.org/10.1016/j.cageo.2016.09.001

stain commented 5 years ago

Thank you for submitting to RO2019's open peer review process. We will shortly be assigning members from the Programme Committee to review.

Feel free to respond to reviewers comments and to update the submission if needed.

Tip: Anyone is welcome to add an informal review below using GitHub comments; as an author perhaps you would volunteer to review one of the other open submissions?

Reviewers, please copy this review form and add as a comment. You don't need to use this form if you are not assigned from the PC.

## Quality of Writing
_Is the text easy to follow? Are core concepts defined or referenced? 
Is it clear what is the author's contribution?_

(delete as appropriate)
* excellent / good / fair / poor

## Research Object / Zenodo

_URL for a Research Object or Zenodo record provided?
   Guidelines <http://researchobject.org/ro2019/submitting> followed?
   Open format (e.g. HTML)?
   Sufficient metadata, e.g. links to software?
   Some form of Data Package provided?
   Add text below if you need to clarify your score._

(delete as appropriate)
* none (e.g. only abstract in easychair/github)
* basic (e.g. Zenodo with PDF and minimal metadata)
* sufficient (e.g. HTML, detailed Zenodo metadata)
* good (followed guidelines, demonstrating own format, related resources included, but some details missing)
* excellent (e.g. followed all guidelines, complete metadata or RO-like research data package, linked data, provenance)

## Overall evaluation
_Please provide a brief review, including a justification for your scores. 
Both score and  review text are required._

(delete as appropriate)
* strong reject
* reject
* weak reject
* borderline 
* weak accept
* accept
* strong accept

For confidential remarks or questions about the peer-review process, contact ro2019@easychair.org

dakoop commented 5 years ago

I will review

dakoop commented 5 years ago

Quality of Writing

Is the text easy to follow? Are core concepts defined or referenced? Is it clear what is the author's contribution?

excellent

Research Object / Zenodo

URL for a Research Object or Zenodo record provided? Guidelines http://researchobject.org/ro2019/submitting followed? Open format (e.g. HTML)? Sufficient metadata, e.g. links to software? Some form of Data Package provided? Add text below if you need to clarify your score.

good (followed guidelines, demonstrating own format, related resources included, but some details missing)

Comments

The demonstration is located on the GitHub project page.
It seems like the example on the GitHub page would live on s3, but the live version is not publicly accessible?
No preview available on the zenodo page

Overall evaluation

Please provide a brief review, including a justification for your scores. Both score and review text are required.

accept

Comments

The abstract describes an interesting real-world solution for the distribution of datasets that are managed via manifests. I think this poster will provide some interesting discussion topics, especially given that its design has been influenced by scientists in cell science. I think good figures and a running example will be useful for helping others understand the solution.

I like the idea of dataset versioning that Quilt provides and agree with the reproducibility concerns without such support
Some more comparison discussing the choice of Quilt versus other existing formats (e.g. BagIt) would be useful to understand differences
I imagine there will be some images in the poster, and I would suggest the example CSV manifest and some diagram of the links between files and metadata entries (with respect to the "illustrative example" in the Implementation section)
Given the work in the geospatial community, is Quilt3Distribute also applicable there?
It would be useful to understand what metadata.csv (from the example results on the project page page) looks like
While the package checks for the existence of files at creation time, it is unclear what happens if the s3 storage goes away or filenames are changed. I understand that if something is "distributed", it is likely meant to be static, but this doesn't always happen in practice, especially years later.
With respect to the implementation, what happens if all values in a metadata column cannot be cast to a common type?

stain commented 5 years ago

I will review

stain commented 5 years ago

Quality of Writing

good

Research Object / Zenodo

basic (e.g. Zenodo with PDF and minimal metadata)

[x] URL for a Research Object or Zenodo record provided? [x] Guidelines http://researchobject.org/ro2019/submitting followed? [ ] Open format (e.g. HTML)? [ ] Sufficient metadata, e.g. links to software? [ ] Some form of Data Package provided?

The abstract is technically not Open Access, as it is uploaded to Zenodo with a custom BSD-like license that "prohibits redistribution and use for commercial purposes without further permission". This is the same license as the described Quilt3Distribute source code, which I do not find on the list of OSI-approved open source licenses.

The authors should re-submit the abstract only under an Open Access license like CC-BY-4.0 that do not restrict commercial use of the text, and clarify within the text that the software is not Open Source (although the source code is available).

Overall evaluation

accept

This abstract describes how datasets that are typically tabular also can be described in a tabular form, using local file name references to the "actual" datasets. (this kind of manifest is also found in ISATab and in SEEK data registration using RightField).

I am not sure if tabular here means CSV files or spreadsheets in formats like .xlsx, presumably the latter due to the mention of dealing with the distinction between integer and string values.

There are not many details provided about the Quilt Package manifest format produced, how packaging and data transfer is done practically, which metadata fields are commonly used or if any external vocabularies are re-used etc. This should be shown on the poster.

The Quilt software and its existing practical use does sound like an interesting piece to demo in Research Object workshop as it deals with producing data packaging in a "researcher friendly" way.

The abstract does not describe how the manifest/metadata is subsequently consumed/query - are they mainly for humans or are they also consumed programmatically across all datasets?

stain commented 5 years ago

Apologies for the delay, @JacksonMaxfield

We are happy to announce that your poster has been accepted for RO2019.

The informal poster session is in the workshop room during lunch 12:00-13:00, as well as in the "unconference" session - see https://researchobject.github.io/ro2019/schedule for the programme.

We welcome poster presenters to do a lightning talk at the 11:55 slot (~ 2 minutes) - one slide.

We would like to invite you to also do a demo during the unconference session, this can either be short plenary demo on the shared screen (~5 mins), or on your laptop in break-out groups.

ResearchObject / ro2019