Formats for exchange of scientific data

bkuczenski commented 5 years ago

We need a stable, consistent format for sharing input data required and deliverable results developed by the other working groups. This includes a couple of dimensions.

The Data Package library has been proposed for the purpose of sharing structured data.
Also mentioned was the https://mybinder.org project, which allows Git repositories to be executed and run in auto-generated interactive notebooks

Review and evaluate suitability and usefulness. Identify alternatives, and other objectives not met by these tools.

tmillross commented 5 years ago

Chris has stated somewhere that he's 100% confident we should use the Python Data Package library which uses the Frictionless Data specifications. Bo gave his support JSON-LD as the project's interchange format. JSON-LD is a subset of JSON, and we know data package is JSON-based.

I have seen no objections or alternative suggestions made in the mailing list or elsewhere, and also support the JSON direction myself. So a key question is - do the Frictionless Data Specs (and/or software) enable Bonsai's required linked-data / RDF (serialized as json-ld) requirements?

A closed issue and related open-issue on the datapackage repos give cause for doubt. As does this more recent question on a different repository, and this frictionless data case study, stating:

We would also like to see graph data packages developed as part of the Frictionless Data specifications...

I'm not yet sure the extent to which this is an issue. Any thoughts on this @cmutel, @bkuczenski or others?

cmutel commented 5 years ago

For correspondence tables, I think the base format should be RDF (see follow up post from me), as it gives much more granularity. In general, though, I think it makes sense to provide data packages whenever possible, as given the historical rate of change in the LCA software community, it is doubtful that they will be able to consume RDF anytime soon.

I don't think that there is any debate on using JSON-LD as the RDF standard, though we will need a spec. Having a versioned spec would also be super helpful for application writers.

So, once we have the RDF model locked down, I think it is conceptually pretty simple to get two way conversion from RDF to CSVs (data package). As the professionals are struggling to embed one in the other, it doesn't make sense for use to attempt this as well. package.jsonld sound cool, but the website is down (not a good sign...).

tmillross commented 5 years ago

Thanks for the quick reply Chris. A few clarifications please:

the base format should be RDF

As I understand, RDF is a framework rather than format. And for our communications I think this distinction is quite important, to help people grasp the technologies we're using. This also relates to:

two way conversion from RDF to CSVs (data package).

...Should this say "from JSON-LD"?

it makes sense to provide data packages whenever possible, as given the historical rate of change in the LCA software community, it is doubtful that they will be able to consume RDF anytime soon.

If we provide "Data Packages" (capitals added: implying this spec) then we are not just providing CSVs, but also the schema in JSON which describes the structure of those CSVs (if my understanding is correct). Does existing LCA software actually support that?

If you mean the SimaPro CSV format (as mentioned on the video call yesterday) - I think this does not qualify as a FrictionlessData.io Data Package. Could you clarify please?

You also mentioned in the video-call that you'd tasked someone to understand the SimaPro CSV format... they could check out this repo which may already contain much of the knowledge we need :)

cmutel commented 5 years ago

As I understand, RDF is a framework rather than format. And for our communications I think this distinction is quite important, to help people grasp the technologies we're using.

Sorry, thanks for correcting me - I am still appreciating the subtleties here.

Should this say "from JSON-LD"?

Yes.

If we provide "Data Packages" (capitals added: implying this spec) then we are not just providing CSVs, but also the schema in JSON which describes the structure of those CSVs (if my understanding is correct). Does existing LCA software actually support that?

Existing software doesn't support much, so I don't take this argument all that seriously. Data Packages are proposed as the standard for regionalized LCIA, and actually this standard can be used for all LCIA methods. And they are awesome to work with, at least in my experience - just the right amount of flexibility. But this conversion is definitely lower down on the priority list than getting the JSON-LD stuff settled.

You also mentioned in the video-call that you'd tasked someone to understand the SimaPro CSV format.

Yes, this is one of the reverse-engineering approaches. But Rutger is a SimaPro developer, so while it would be great for him to build on what GreenDelta has done, I guess he can also be productive starting from scratch. We will see!

bkuczenski commented 5 years ago

One of the virtues of the hackathon approach is that we will have concrete problems to shape our work. It's easy to get paralyzed wondering whether a given approach will work for a massive set of hypothetical use cases but we will hopefully have a small number of concrete use cases that will really matter.

I think it sounds like DataPackages are a good thing to have competency in, but I suspect it will be up to this working group (rather than the others) to attain compliance with those specs.

I will add a task to gain that competency, and then as soon as we are generating outputs, we will have a task to convert them into DataPackages for archival and testing.

bkuczenski commented 5 years ago

See issue #6 in this repo.

BONSAMURAIS / reproducibility

Formats for exchange of scientific data #4