Closed evamaxfield closed 5 years ago
Thank you for submitting to RO2019's open peer review process. We will shortly be assigning members from the Programme Committee to review.
Feel free to respond to reviewers comments and to update the submission if needed.
Tip: Anyone is welcome to add an informal review below using GitHub comments; as an author perhaps you would volunteer to review one of the other open submissions?
Reviewers, please copy this review form and add as a comment. You don't need to use this form if you are not assigned from the PC.
## Quality of Writing
_Is the text easy to follow? Are core concepts defined or referenced?
Is it clear what is the author's contribution?_
(delete as appropriate)
* excellent / good / fair / poor
## Research Object / Zenodo
_URL for a Research Object or Zenodo record provided?
Guidelines <http://researchobject.org/ro2019/submitting> followed?
Open format (e.g. HTML)?
Sufficient metadata, e.g. links to software?
Some form of Data Package provided?
Add text below if you need to clarify your score._
(delete as appropriate)
* none (e.g. only abstract in easychair/github)
* basic (e.g. Zenodo with PDF and minimal metadata)
* sufficient (e.g. HTML, detailed Zenodo metadata)
* good (followed guidelines, demonstrating own format, related resources included, but some details missing)
* excellent (e.g. followed all guidelines, complete metadata or RO-like research data package, linked data, provenance)
## Overall evaluation
_Please provide a brief review, including a justification for your scores.
Both score and review text are required._
(delete as appropriate)
* strong reject
* reject
* weak reject
* borderline
* weak accept
* accept
* strong accept
For confidential remarks or questions about the peer-review process, contact ro2019@easychair.org
I will review
Is the text easy to follow? Are core concepts defined or referenced? Is it clear what is the author's contribution?
URL for a Research Object or Zenodo record provided? Guidelines http://researchobject.org/ro2019/submitting followed? Open format (e.g. HTML)? Sufficient metadata, e.g. links to software? Some form of Data Package provided? Add text below if you need to clarify your score.
Please provide a brief review, including a justification for your scores. Both score and review text are required.
The abstract describes an interesting real-world solution for the distribution of datasets that are managed via manifests. I think this poster will provide some interesting discussion topics, especially given that its design has been influenced by scientists in cell science. I think good figures and a running example will be useful for helping others understand the solution.
I will review
[x] URL for a Research Object or Zenodo record provided? [x] Guidelines http://researchobject.org/ro2019/submitting followed? [ ] Open format (e.g. HTML)? [ ] Sufficient metadata, e.g. links to software? [ ] Some form of Data Package provided?
The abstract is technically not Open Access, as it is uploaded to Zenodo with a custom BSD-like license that "prohibits redistribution and use for commercial purposes without further permission". This is the same license as the described Quilt3Distribute source code, which I do not find on the list of OSI-approved open source licenses.
The authors should re-submit the abstract only under an Open Access license like CC-BY-4.0 that do not restrict commercial use of the text, and clarify within the text that the software is not Open Source (although the source code is available).
This abstract describes how datasets that are typically tabular also can be described in a tabular form, using local file name references to the "actual" datasets. (this kind of manifest is also found in ISATab and in SEEK data registration using RightField).
I am not sure if tabular here means CSV files or spreadsheets in formats like .xlsx
, presumably the latter due to the mention of dealing with the distinction between integer and string values.
There are not many details provided about the Quilt Package manifest format produced, how packaging and data transfer is done practically, which metadata fields are commonly used or if any external vocabularies are re-used etc. This should be shown on the poster.
The Quilt software and its existing practical use does sound like an interesting piece to demo in Research Object workshop as it deals with producing data packaging in a "researcher friendly" way.
The abstract does not describe how the manifest/metadata is subsequently consumed/query - are they mainly for humans or are they also consumed programmatically across all datasets?
Apologies for the delay, @JacksonMaxfield
We are happy to announce that your poster has been accepted for RO2019.
The informal poster session is in the workshop room during lunch 12:00-13:00, as well as in the "unconference" session - see https://researchobject.github.io/ro2019/schedule for the programme.
We welcome poster presenters to do a lightning talk at the 11:55 slot (~ 2 minutes) - one slide.
We would like to invite you to also do a demo during the unconference session, this can either be short plenary demo on the shared screen (~5 mins), or on your laptop in break-out groups.
Authors
Keywords
Homepage
https://github.com/AllenCellModeling/quilt3distribute
Abstract
A core principal of research is the affordability and ease of reproducing the results found by an experiment and to minimize the challenge of experimental reproducibility, it is common for researchers to share the dataset used to produce the results of an experiment. Methods for managing and distributing these datasets however, are ill-suited for imaging datasets, or more generally: large object datasets, because they commonly resemble a manifest and require additional packaging and organization than their feature set counterparts.
Quilt3Distribute (Q3D) is a software application that enables the distribution of manifest style datasets which can be made of up thousands of individual files.
Full Abstract Available at:
(shown below)
Title: Managing Manifests and Distributing Datasets Date: 09.01.2019 Author: Jackson Brown, Allen Institute for Cell Science Corresponding Author Email: jacksonb@alleninstitute.org
Managing Manifests and Distributing Datasets
Abstract
A core principal of research is the affordability and ease of reproducing the results found by an experiment and to minimize the challenge of experimental reproducibility, it is common for researchers to share the dataset used to produce the results of an experiment. Methods for managing and distributing these datasets however, are ill-suited for imaging datasets, or more generally: large object datasets, because they commonly resemble a manifest and require additional packaging and organization than their feature set counterparts.
Quilt3Distribute (Q3D) is a software application that enables the distribution of manifest style datasets which can be made of up thousands of individual files.
Manifests
There are many ways to store a dataset made up of thousands of files, but often these options have significant associated costs. One such method of packaging and distributing these datasets made popular in recent years, is the containerization of data [2, 4]. Due to the nature of geospatial data, scientists and engineers have been creating file formats, data packaging tools, and distribution systems for many years to solve these exact problems which have ultimately established standards for the packaging and distribution of their fields datasets [4]. Following suit, our proposal for storage and distribution of manifest style datasets uses a software package that falls under the containerization umbrella. "Quilt" is a software package that adds a number of features to current state of the art storage systems (e.g. Amazon's S3) [3]. Of note: Quilt performs versioning of dataset containers (what Quilt refers to as data packages), which is a desirable feature for the scientific resource stack as versioning is critical for reproducibility.
Implementation
This project expands on state of the art methods by specifically targeting the distribution of manifest style datasets. For most scientists at the Allen Institute for Cell Science, datasets are commonly tabular, with a column, or multiple columns, filled with file paths to larger resources (images). The rest of the tabular dataset is largely metadata about the files referenced in each row. To expand on the state of the art, our Python package attaches the metadata from columns found in a tabular manifest style dataset to each file found in the manifest before distributing them with Quilt.
To help the end user of the dataset, our package additionally cleans and standardizes the metadata it finds. A few examples of these cleaning and validation behaviors are:
1) Asserting or casting all values in a metadata column are that of the most common data type for that metadata column. (Ex: if the values "4" (integer four), "1" (integer one), and "'4'" (string four) exist in the same column, the application attempts to cast the instance of the string four to the columns most common data type: an integer)
2) Enforcing that every file path found exists prior to attempting distribution. (Commonly, when collaboratively working on shared datasets, files will move to a new location during the process of development.)
Learning from the extensive work on linked data, retention of file associations is an incredibly valuable tool for not only navigation between objects but also the relationships between those objects [1]. To apply these linked data concepts in our own application, if a manifest has multiple columns that contain file paths to resources, our application additionally constructs a lookup table for those shared file associations and places the relevant portion of the lookup table in each file's metadata. An illustrative example of this behavior is when an imaging manifest has a column for full size images and a column for thumbnail images. For each row of the manifest, the application makes an entry in the Quilt package that pairs the full size image to the smaller thumbnail image by storing the read path to the other in the metadata for each member of the pair.
Quilt3Distribute (Q3D) is an open-source Python application and additional details and documentation can be found at: https://github.com/AllenCellModeling/quilt3distribute. Q3D is currently used as a primary production system to distribute terabytes of imaging datasets from the Allen Institute for Cell Science.
Acknowledgements
We wish to thank the Allen Institute for Cell Science founder, Paul G. Allen, for his vision, encouragement, and support. This work could not have been completed without the additional support and input from all members of the AllenInstitute for Cell Science modeling team.
References