TheClimateCorporation / mandoline

A distributed, versioned, multi-dimensional array database
Other
105 stars 17 forks source link

core.matrix support #3

Open mikera opened 10 years ago

mikera commented 10 years ago

Hello,

Mandoline looks like a great library for managing multi-dimensional array data!

Would it be possible to extend mandoline to provide core.matrix support, so that a mandoline data set can be used via the core.matrix API?

Advantages:

See: https://github.com/mikera/core.matrix

I'm the maintainer of core.matrix, happy to make improvements as necessary to improve interoperability with libraries like mandoline. Also happy to give input / guidance on how to make this work efficiently.

Mike.

monodeldiablo commented 10 years ago

Hey Mike!

UCAR's Array was chosen more out of convenience than necessity, since our primary input data format is NetCDF. Using the same library (and thus, in-memory format) as our ingest processes means we avoid unnecessary copy overhead (most of the time).

Nevertheless, interoperability is the name of the game. And your help is much appreciated.

How deep into Mandoline do you see core.matrix support extending? Do you envision a view on top of Mandoline's existing Slab type, or something deeper?

mikera commented 10 years ago

That sounds fine: core.matrix is designed to enable use of different underlying array types.

It's probably most convenient just to extend the core.matrix protocols either to the UCAR array type and/or the Slab type - that way you would get full core.matrix API support essentially for free. This is the approach that Clatrix (https://github.com/tel/clatrix) takes, for example.

Most of the protocols are optional, you only need to implement a small number to get the API working - the rest are available for improving performance and can be added later.

P.S. As an aside you'll probably find copy overhead is trivial compared to the IO cost of the query. The only time I've found copy overhead to matter is in extremely CPU-bound numerical code.