frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
481 stars 109 forks source link

Create a reference dataset collection with representative examples of currently supported & planned data types #876

Closed khughitt closed 2 months ago

khughitt commented 5 months ago

In order to help us think clearly about exactly what types of data data package is intended to support either in presently or in the future, it could be helpful to create a repo with example datasets of different types.

This will help both when thinking about what the spec should look like to properly represent all of the intended types of data, and provide a useful resource for writing test code.

For new users coming to Frictionless and wonderful whether it supports their data type, this could also be a good way to get started.

Repo structure

My first thought was to organize the repo by data type/modality (e.g. table, image, spectral, etc.), but actually, it might be better to do it by domain?

This way things are grouped together logically, in a way one it more likely to encountered them in the wild, and it would allow people working in the various domains to see, at a glance, which of the data types they tend to work with are represented?

astro/
bio/
  biodiversity/
  omics/
    rna/
      rnaseq/
        fastq/
          sample1.fq
          datapackage.yml
          README.md
        counts/
          counts.tsv
          datapackage.yml
          README.md
    multiomics/
econ/
earth/
finance/
etc/

There are obviously a bunch of different ways one could organize things. There may also be existing taxonomies of data types / domains that we could work off of.

I would be careful not to worry to much about getting "the" right structure because I don't think there is going to be a single one that works well for everything.. Instead, let's just get something started, and then iterate on it and improve it as our understanding of the intended scope of the project evolves.

Dataset directory contents

How to go about creating the repo?

Possible approach:

  1. each working group member could come up with a list of data types they think are relevant, and a possible directory structure/naming convention for how they can be organized
  2. we can come these into a single "fake" directly listing to show what the combined structure would look like
  3. meet & discuss / decide if changes need to be made
  4. once we are happy with the planned structure, a repo can be created, and user's can start add representative datasets with the appropriate licensing.
    • it's probably also worth adding a script/hook to automatically generate an index/ToC for all of the datasets.

Other considerations

@khusmann I couldn't pull up your comments from our Slack discussion about this a few months back, but I know you had some different ideas on this so please feel free to comment/share how you were thinking about this.

Anyone else is welcome to chime in, too, obviously. This is really just intended to get the ball rolling.

khughitt commented 5 months ago

Couple follow up thoughts..

Another benefit of having a reference dataset collection like this is that, for all of the "supported" datasets for which the repo contains a working data package, one could easily write a script to traverse the directory structure, parse the datapackage.yml files, and create various "views" into the collection.. For example, we could find all image datasets and create a page listing them.

If enough of the reference data packages encoded information about the variables represented by the different axes of the data, then one could also group together datasets indexed or construct a network depicting the relationship between the reference datasets.

This could also fit in nicely with efforts to create useful react, etc. components for rendering data packages.

khusmann commented 4 months ago

Thanks for spearheading this issue @khughitt!

I don't have much to add, except I like the idea of making it possible to create "views" into the collection! We might also facilitate this with some custom field props that would allow us to tag / label fields for aggregation in different ways.

peterdesmet commented 4 months ago

While I like the idea of showing what data/domains Data Package can support, I'm worried about maintenance. In my experience, if the owner of an uploaded example dataset leaves the project, it is typically orphaned, with the remaining maintainers not having enough familiarity to know why it was added and how to maintain it.

  1. I would prefer linking out to example datasets per domain, which are maintained elsewhere (e.g. on Zenodo). It's a lot easier to maintain and indicates to users that we're not responsible for them.
  2. I really think we should have a number of semi-artificial test datasets. These should include a Frictionless v1 (for backward compatibility) and a dataset with all Data Package's (newest) features. Any PR for a spec update should trigger an automated test to see if current datasets still validate (cf. these rules). We use this approach for Camtrap DP and it has proven very useful: https://github.com/tdwg/camtrap-dp/blob/main/.github/workflows/validate-example-current-schema.yml
khughitt commented 4 months ago

The main motivation is not so much to show what kinds of data Data Package supports (that's sort of a bonus side effect), but rather, the goal is to help us to think clearly about the intended scope of the project and, as much as it is possible, to figure out ahead of time what types of structures are going to give us the greatest representative power down the road..

My worry is that, while the original aim of the project (and one that I think is achievable) a truly abstract container for data of all types, due simply to a biased set of viewpoints among the early drivers of the spec / an emphasis on one particular type of data (tabular data), we might end up with something that is really great for tables, and perhaps more cumbersome / less suitable for some other data types.

I think your points are valid though and I think the second suggestion (creating a set of test datasets) is a much more reasonable goal.

khughitt commented 4 months ago

Also, in the original issue description I mixed together two ideas that I think could be helpful to explicitly separate out:

  1. data "structure" (e.g. table, image, audio, spec, etc.) vs.
  2. data domain (econ, bio, climate, survey, etc.)

(A third consideration might be file format: e.g. CSV, Parquet, FITS, HDF5, etc.., but that is related to data structure and also easier to modify support for down the road..)

I think both are useful to think about and try and plan for, but the first one will probably have a larger impact on the frictionless codebase and specs, and likewise, be harder to change once we have gone too far down the road with some particular set of assumptions about what data looks like.

khusmann commented 4 months ago

I really like the idea of test / synthetic datasets to exercise frictionless features as well as provide examples. Regarding data generation, I have some bits of code for that here that could be adapted. I also have an (unpublished as of yet) typescript version.