alxmrs / xarray-sql

An experiment to query Xarray datasets with SQL
Apache License 2.0
24 stars 3 forks source link

Zarrquet: Create a referenced based filesystem to read virtual parquet from Zarr. #4

Open alxmrs opened 7 months ago

alxmrs commented 7 months ago

https://arrow.apache.org/docs/python/filesystems.html

Like Kerchunk, it should be possible to create a virtual parquet dataset from an existing Zarr store. Here, it would also provide a table-like view of a dense array — like the rest of qarray.

Ideally, this could be persisted in a file system without creating extra space (just metadata). Then, one could use tools like Pandas, DuckDB, Apache Arrow, etc., to query raster data as tables, maybe even outside of Python.

My bet here is that the LOC to prototype this is going to be small. Getting it performant is another matter. I wonder if this would be a justification for BigQuery or other similarly high powered query engine? BQ reads parquet!

alxmrs commented 7 months ago

cc: @cisaacstern

alxmrs commented 5 months ago

I’ve been doing some background reading on parquet. Here are some useful things I’ve learned:

alxmrs commented 5 months ago

The thrift definition is especially helpful to read.

https://github.com/dask/fastparquet/blob/main/fastparquet/parquet.thrift

Some initial thoughts:

alxmrs commented 5 months ago

Reading the Data Page V1 algorithm in these sources (https://github.com/dask/fastparquet/blob/main/fastparquet/core.py), I think most Zarr would be represented in a Plain Thrift encoding.

alxmrs commented 5 months ago

I've just started chatting with @martindurant in a side channel (Martin, I'm happy to bring the discussion here). He has a few thoughts on the matter:

I'm really happy to have a vote of confidence on this approach from a domain expert!

alxmrs commented 5 months ago

I've discussed how to surface Zarrquet to databases with Martin a bit more. Here's the best path forward from the discussion:

jatorre commented 4 months ago

So, hi all, let me jump in here, because I think we are going on the same direction...

In Zarrquet we will store a chunk as a row, and bands as columns with xarrays data encoded. So the schema on Parquet is kind of Zarr compliance, and the content of the bands is zarrays.

Or in a virtual way, you are thinking on creating a "virtual" parquet file with the contents of a geozarr. And then BigQuery using a external table (parquet virtual tables), you end up creating a Virtual BigQuery table that is a connected to a Virtual Parquet File that is backed up by Geozarr stored in Cloud Storage?

martindurant commented 4 months ago

Yes, this is totally doable, to make a python file-like where the buffers are fetched from zarr chunks, and the parquet thrift metadata are made dynamically at runtime (or, optionally, stored elsewhere, which might be the easier way to start).

BigQuery using a external table

How does one persuade bigquery to read from a python file? We can get pyarrow or even duckdb to do this, but ...

xarrays data encoded.

Meaning simple things like gzip/zstd compression?

martindurant commented 4 months ago

Maybe you are thinking to host an HTTP server which pipes through virtual parquet bytes?

alxmrs commented 4 months ago

you end up creating a Virtual BigQuery table that is a connected to a Virtual Parquet File that is backed up by Geozarr stored in Cloud Storage?

Yes, this is the end game.

Maybe you are thinking to host an HTTP server which pipes through virtual parquet bytes?

Yes — and the big unknown for me is the cloud architecture question. How do you build a proxy GCS bucket? I was imagining an elastic load balancer would play a key role.

jatorre commented 4 months ago

I am thinking less virtual in a way, what about the other no dynamic way.

In Zarrquet we will store a chunk as a row, and bands as columns with xarrays data encoded. So the schema on Parquet is kind of Zarr compliance, and the content of the bands is zarrays.

As a static file.

alxmrs commented 4 months ago

This idea has a lot of merit. We did something similar over here: https://github.com/google/weather-tools/blob/main/weather_mv/README.md

The best place for the static approach (copying Zarr into Parquet) would be as an output sink for weather-mv.

In fact, this repo was inspired by this weather tool. The goal here is to find a way to provide a similar experience to the one offered by weather-mv bq without doubling storage or waiting for a large ETL job to finish.

alxmrs commented 4 months ago

So the schema on Parquet is kind of Zarr compliance, and the content of the bands is zarrays.

Wait a second! I re-read this — I think I may have misunderstood what you meant. Are you saying that we can accomplish the goal of no-copy Zarr/Parquet with a static file?

I have began thinking this is possible! It looks like there is room in Parquet’s thrift format to point to Zarr array blobs as the backing data.

I have some notes about this around here: https://github.com/alxmrs/xarray-sql/issues/4#issuecomment-2002371311

martindurant commented 4 months ago

Wait a second! I re-read this — I think I may have misunderstood what you meant. Are you saying that we can accomplish the goal of no-copy Zarr/Parquet with a static file?

You can store the thrift serialised objects in file(s) rather than make them on the fly; but no, parquet does not let you point to binary buffers in some other file. It does in theory allow for each "column chunk" (thrift header + repetition/definition + buffer) to be in a different file, but in practice every parquet file except _metadata has a complete thrift footer and whole row groups.

martindurant commented 4 months ago

Also, I was chatting about this yesterday: it is totally possible to store N-D arrays in parquet despite its columnar nature: a 3D array would be list(list(number)), and the required rep/def levels would be easy to generate, as they would be regularly repeating. In V2 pages, the buffer is compressed separately from the rep/defs, so streaming data from chunks would work fine. The question is, do any parquet frameworks expect this sort of data?

martindurant commented 4 months ago

Also work noting (sorry for the spamming): fastparquet allows you to directly instantiate and serialised thrift objects. I don't believe arrow has any such ability. python-thrift also allows this, but is much slower.

jatorre commented 4 months ago

I am a bit lost here since I thought we would be able to store the arrays as arrow.

Not the same but geoarrow is doing this for vector data. I assumed using nested arrays in arrow was going to be no problem.

On Wed, 10 Apr 2024 at 06:44, Martin Durant @.***> wrote:

Also work noting (sorry for the spamming): fastparquet allows you to directly instantiate and serialised thrift objects. I don't believe arrow has any such ability. python-thrift also allows this, but is much slower.

— Reply to this email directly, view it on GitHub https://github.com/alxmrs/xarray-sql/issues/4#issuecomment-2047585009, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA7GO4PISGMYN65MOPZ2G3Y4U62BAVCNFSM6AAAAABCGQTZDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBXGU4DKMBQHE . You are receiving this because you commented.Message ID: @.***>

martindurant commented 4 months ago

I am a bit lost here since I thought we would be able to store the arrays as arrow.

What do you mean by "arrow" here? Parquet is not arrow; feather2 essentially is.

Not the same but geoarrow is doing this for vector data. I assumed using nested arrays in arrow was going to be no problem.

Yes, the parquet format can store nested data, which would include geo structures (like list of points -> polygon, for each row). But this is presumably not what you have in zarr, right? Zarr does not have a concept of variable-length structures in each array position, except for string, JSON and pickle encoding (i.e., python objects, not what you want)

So: please show me some example zarr data you'd like to view, and I'll see if I can make a python file-like to hand to pyarrow. But I'll be using fastparquet to make the thrift, since I don't believe there's any other way.

alxmrs commented 4 months ago

parquet does not let you point to binary buffers in some other file. It does in theory allow for each "column chunk" (thrift header + repetition/definition + buffer) to be in a different file, but in practice every parquet file except _metadata has a complete thrift footer and whole row groups.

Thanks Martin. That saves me from having to discover this first hand! Do you know of any format that would be a good fit, that allows for extra file storage? Feather, Avro, ORC, maybe?

The question is, do any parquet frameworks expect this sort of data?

Yes, this is really interesting! I agree, though, I don't see a use case for it for me in the near term.

I assumed using nested arrays in arrow was going to be no problem.

In a larger context, I see this work as largely experimental. It very well could not pan out as I find more information! Take these plans with a grain of salt.

martindurant commented 4 months ago

Do you know of any format that would be a good fit, that allows for extra file storage?

Sorry, no

alxmrs commented 3 months ago

Given https://github.com/zarr-developers/zarr-specs/issues/287, it looks like future versions of Zarr will allow an fsspec interface beyond Python. It seems like a proxy server won’t be required.

alxmrs commented 2 months ago

Pathway to Postgres integration: https://github.com/paradedb/paradedb/tree/dev/pg_lakehouse#overview

martindurant commented 2 months ago

You want postgres to query zarr data chunks with a parquet dynamic ephemeral intermediate representation? Actually, teaching duckDB about zarr might be simpler :) .

alxmrs commented 1 month ago

@TomNicholas I just saw your tweet related to “headless data” and Pangeo. https://x.com/TEGNicholasCode/status/1813299299201294583

I wanted to reply: I think something like “Zarrquet” could be a good fit for the SQL interoperability box for pangeo.

TomNicholas commented 1 month ago

Thanks for tagging me @alxmrs - I think this also relates closely to something @manzt and I were talking about at SciPy. Basically using SQL tools to analyze Zarr stores on a chunk-level