frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
501 stars 114 forks source link

Schema for (multidimensional) array #767

Closed tamaracha closed 7 months ago

tamaracha commented 2 years ago

Hello,

I am working in the field of research data management at the university of Marburg (Germany), and the architecture of Frictionless Data looks very reasonable to me. We are facing the challenge of having to migrate lots of diverse clinical, neuroimaging, and behavioral data into a unified representation. I think this might be a desirable long-term goal, but it sounds more reasonable to describe the data adequately via specs in advance, and implement spec-based transforms and pipelines to access and evaluate the data. Existing data could be migrated step by step without too much time pressure. So I'm very interested in contributing to Frictionless because I appreciate the idea behind it.

What I am missing is a schema for n-dimensional arrays to describe dense data. With dense data, I mean most values not being 0. A matrix would be a 2D array as a special case. Physically seen, this is a sequence of values decorated with a list of dimension descriptors. Each dimension extends to a given length, so the product of these lengths has to be equal to the length of the values sequence. Logically, values are accessed via indices combination. Although the same information could also be represented by tables with primary keys, arrays are physically more space-efficient for dense data, and in its logical/abstract form, they are more appropriate for certain algorithms, e.g. linear algebra. The array is the abstraction for concrete formats like images, brain scans, game boards etc.

Regarding arrays as an important basic data structure, I'd like to contribute a spec as a foundation for more concrete custom use-cases like image processing or neuroimaging data. I couldn't find any discussions concerning this topic in the issue tracker or discussion forum. Are there any reasons against this? Does it go beyond the scope of Frictionless? My rationale would be that it deals with data representation, it's about metadata describing how data is stored, and data representation should be broken down into abstract and concrete concepts for modularity, interoperability, and re-usability. I can send a proposal via PR when a first draft is ready.

Thanks for your attention and sorry for the lengthy explanations.

rufuspollock commented 2 years ago

Regarding arrays as an important basic data structure, I'd like to contribute a spec as a foundation for more concrete custom use-cases like image processing or neuroimaging data. I couldn't find any discussions concerning this topic in the issue tracker or discussion forum. Are there any reasons against this? Does it go beyond the scope of Frictionless? My rationale would be that it deals with data representation, it's about metadata describing how data is stored, and data representation should be broken down into abstract and concrete concepts for modularity, interoperability, and re-usability. I can send a proposal via PR when a first draft is ready.

This would be very welcome. There isn't currently anything for n-dim arrays and that is useful.