JDASoftwareGroup / kartothek

A consistent table management library in python
https://kartothek.readthedocs.io/en/stable
MIT License
161 stars 53 forks source link

[QUESTION] Integration of Zarr with karothek #119

Open lewismc opened 5 years ago

lewismc commented 5 years ago

Hi Folks,

First and foremost, I've spent the last few hours working my way through your blog posts, RTD documentation, and source code documentation and really like where you guys are going with karothek. Thank you so much for open sourcing this project. Having a bias for the JVM I've been interested in Apache Iceberg for some time but Python is where we need to be to process the N-dimensional array data (Earth orbiting remote sensing missions which natively produce write-optimized netCDF4/5 and HDF5/EOS) we work with at NASA so karothek was a real pleasant surprise.

I wanted to ask if anyone here has been working with Zarr? I envision kartothek as the dataset and table management layer with the actual arrays being stored as Zarr implementations which are highlighly read and analysis optimized... basically analysis ready (in something like Dask)!

Thanks again for sharing this under a permissive license... excellent work.

fjetter commented 5 years ago

First of all, I really appreciate your interest. Especially for a young open source project this is highly encouraging!

I completely undestand you bias towards JVM. There are so many powerful technologies out there in this eco system which I don't want to lock us out from. In fact, the compatibility to other technologies and the integration into the community was one of the big drivers for us to put kartothek out in the open. Apache Iceberg is something we're also actively looking into for that.

We're mostly working with Apache Parquet as a file format but we built an interface to incorporate other formats as well. Have a look at https://github.com/JDASoftwareGroup/kartothek/blob/0d9483ce83c2af1d7d1a821d54106b73ed226dbb/kartothek/serialization/_generic.py#L18 where it should be possible to implement an HDF serializer. Maybe that already helps you to get started with your data.

Regarding zarr I must shamefully admit that this is the first time I'm hearing about it. I'll look into it and come back to you. If you already have something specific in mind, feel free to share your ideas.

lewismc commented 5 years ago

Thanks for the response @fjetter

We're mostly working with Apache Parquet as a file format but we built an interface to incorporate other formats as well.

That makes sense. This is also reflected in Iceberg with additional 'native' support for Avro and ORC. I have been working with Avro for a long time but find that use within the scientific Python community in very low. I suspect that this is due to the huge overhead incurred as a result of having to copy and persist huge N-dimensional arrays in Avro files... this is just not very practical. Many of our datasets are simply too large to make serializing everything in Avro a possibility at this stage... the file format has just not seen much uptake in the community.

where it should be possible to implement an HDF serializer.

Thank you for this pointer. I saw the DataFrameSerializer when I was reading the source and was thinking about the practicalities of mapping from Zarr (similar to Numpy) to DataFrame. Can you clarify one thing for me, the documentation states ...When working with kartothek tables as a Python user, we will use DataFrame as the user-facing type.. Should the latter part of this statement read "...When working with kartothek tables as a Python user, a DataFrame is provided as the user-facing type."? This would clarify whether DataFrame is the only current user-facing type, The reason I ask is that I am trying to quantify the complexity involved with facilitating the Zarr (where I can address my data engineering tasks such as chunking, compression, etc.) --> DataFrame mapping so that kartothek can address dataset management for these slow moving Earth science datasets.

Thanks very much for any comments you have.