data-apis / array-api

RFC document, tooling and other content related to the array API standard
https://data-apis.github.io/array-api/latest/
MIT License
204 stars 42 forks source link

Using a neutral format to have lossless interface between multidimensional tools #786

Closed loco-philippe closed 2 months ago

loco-philippe commented 2 months ago

Do you think that the work described below can be associated with the discussions carried out by data-apis?

I proposed a neutral format for describing and sharing multidimensional data (see jupyter notebook, github repository, PyPI package ).

Its use allows reversible interfaces (round trip without loss) between tools. The examples discussed are as follows:

The notebook shows for example, we can losslessly convert a scipp to a Xarray dataset or convert it to JSON format.

It also handles easily exchangeable lightweight structures (only metadata pointing to URIs to access data stored in independent environments). Data typing is based on the semantic types defined by the NTV format.

The package is built based on numpy.ndarray.

A second version will integrate tabular representations (integration of the NTV-TAB format and the [NTV-pandas] format ](package https://github.com/loco-philippe/ntv-pandas)) and associated interfaces (for example pandas).

The first version (alpha) of the package will be completed based on the use cases that will be expressed.

Thank you in advance for your feedback (github issues and [discussions](https://github.com/loco-philippe /ntv- numpy/discussions) are enabled)!

Note: This proposal is also shared with affected tools (issues)

kgryte commented 2 months ago

@loco-philippe Thank you for reaching out. The NTV format initiative you propose is certainly interesting work; however, I don't think we're likely to take it up at this stage. As a standardization body, we primarily focus on well-established art within the Python ecosystem.

I think your best bet, for the time being, is to continue to engage individual communities (e.g., NumPy, pandas, Xarray, PyTorch, etc), as you are already doing. If the NTV format achieves widespread adoption, it could eventually become a standardization candidate and something in which we'd engage. But given the project's early stages, I think we are a ways out from that.

rgommers commented 2 months ago

Also, I/O is the first topic mentioned as explicitly out-of-scope: https://data-apis.org/array-api/latest/purpose_and_scope.html#out-of-scope. We also haven't considered things like Zarr, Parquet & co. So while work on data formats is in general of interest to the community, I think it's not the best fit for this standard.

If it's about the in-memory data exchange, the features needed by Xarray & co that go beyond what DLPack offers (e.g, labeled axes) aren't part of the standard.

loco-philippe commented 2 months ago

@rgommers, @kgryte, Thank you for taking the time to respond to me.

In fact, the proposed topic only concerns the structure of multidimensional and tabular data.

When we compare the data models of the main tools, we observe differences that make the interfaces more complex.

The concepts to which I refer complement those defined at the level of Array-API (dtype, ndim, shape, size) and DataFrame-API (column):

It seems to me that we could converge on common concepts which do not call into question the existing implementations and which would facilitate exchanges (the tool developed shows that this convergence is possible and that this gives more complete interfaces than those existing).

My question was rather to know if work was underway on these notions (which seem to me to be within the scope of data-apis) and if not if you think that this could be of interest to data-API.

Have a nice day

rgommers commented 2 months ago

My question was rather to know if work was underway on these notions (which seem to me to be within the scope of data-apis) and if not if you think that this could be of interest to data-API.

There is not. All this seems out of scope for the array API standard. I agree it could in principle fit under the Data APIs umbrella, but it's clearly separate from plain arrays/tensors.

rgommers commented 2 months ago

Given that the questions are answered, I'll go ahead and close this issue as "interesting, but out of scope for this project". Thanks @loco-philippe for the interest.