Clay-foundation / model

The Clay Foundation Model (in development)
https://clay-foundation.github.io/model/
Apache License 2.0
262 stars 30 forks source link

ZSTD compression problematic #155

Closed mattpaul closed 1 month ago

mattpaul commented 5 months ago

When exporting embeddings to parquet files we currently use ZSTD compression: https://github.com/Clay-foundation/model/blob/3f4210ff3160240e73c4d3541962032afef957db/src/model_clay.py#L1002

However, ZSTD compression is not widely supported. Specifically, the version of pyarrow packaged in the AWS SDK for Pandas is not built with support for ZSTD, which yields the following error:

type(err)=<class 'pyarrow.lib.ArrowNotImplementedError'>: 
err=ArrowNotImplementedError("Support for codec 'zstd' not built")

This is proving problematic for the Clay vector service at read time.

GeoPandas docs for GeoDataFrame.to_parquet state that the following compression algorithms are available:

compression {‘snappy’, ‘gzip’, ‘brotli’, None}, default ‘snappy’
  Name of the compression to use. Use None for no compression.

@yellowcap - I'd like to request that we switch to a more widely supported compression, such as the default snappy or gzip, in order to facilitate reading/working with Clay embeddings for a wider audience such as those using the AWS SDK for Pandas like the Clay vector service 🤓

Please let me know if you have any questions. Thanks!

mattpaul commented 5 months ago

Linking to AWS SDK for pandas for reference.

Running into problems importing geopandas directly as a dependency in requirements.txt due to package size limitations imposed on lambda functions:

UPDATE_FAILED: ReadParquetLambdaFunction (AWS::Lambda::Function)
Resource handler returned message: "Unzipped size must be smaller than 86233173 bytes (Service: Lambda, Status Code: 400, Request ID: 5f9f750b-f8fb-4c61-bfc7-02ecb4ba3a22)" (RequestToken: 05ca5f53-f912-68a3-75ab-07ef6d06a3c1, HandlerErrorCode: InvalidRequest)

hence hoping to leverage a pre-built lambda layer. Exploring alternatives...

weiji14 commented 5 months ago

For context, ZSTD compression was set in https://github.com/Clay-foundation/model/pull/86#discussion_r1423394603, because it results in slightly smaller file sizes and faster read speeds (decompression). Could you please report the version of aws-sdk-pandas that you are using, is it 3.5.2 or an older version? What version of pandas is it running (show using pd.show_versions())?

Running into problems importing geopandas directly as a dependency in requirements.txt due to package size limitations imposed on lambda functions:

What's the limit for AWS Lambda? The PyArrow library used to read Parquet files is known to be quite big (see Drawbacks section under https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html, which mentions PyArrow requiring 120MB, and explicitly calls out this as a potential issue for AWS Lambda). The situation won't improve longer term though, especially for newer versions of Pandas v2.2+, so you might need to look at non-Lambda options if sticking with Pandas+PyArrow.

Taking a step back though, what are you actually trying to do with AWS Lambda? Are you trying to ingest the GeoParquet files into some database?

mattpaul commented 5 months ago

@weiji14 yes, correct. That is the architecture that was proposed here: https://github.com/Clay-foundation/vector/discussions/3#discussioncomment-7826219

It's unfortunate that the library to simply read a file format should be so large (seems unnecessarily so) though I can appreciate the desire to work with libraries and formats commonly used for data science in interactive notebooks, etc.

I am using the latest version of the AWS SDK for pandas, 3.5.2, via the us-east-1 lamda layer arn for python 3.9 found here: https://aws-sdk-pandas.readthedocs.io/en/stable/layers.html

arn:aws:lambda:us-east-1:336392948345:layer:AWSSDKPandas-Python39:15

Note: I am able to open parquet files with that version of the library so long as the files have been encoded with supported compression types: gzip, snappy.

I created a handful of test cases here for the purposes of verifying which compression algorithms are supported:

s3://clay-vector-embeddings/test-cases/compression/

I took one of the v01 embeddings we originally generated with zstd compression and used the gpq command line tool to re-encode it as brotli, gzip, snappy. You can see the results of attempting to call read_parquet on each test case here:

(upon successful read it is rendering the head of the dataframe as HTML for demo purposes).

@weiji14 can you tell me more / point me to more info on the binary encoding format the model is using to encode the geometry field? not sure how to decode that at the moment. thanks!

weiji14 commented 5 months ago

It's unfortunate that the library to simply read a file format should be so large (seems unnecessarily so) though I can appreciate the desire to work with libraries and formats commonly used for data science in interactive notebooks, etc.

Note that PyArrow is not the only library implementation that can read Parquet files, there are others as well :wink:

can you tell me more / point me to more info on the binary encoding format the model is using to encode the geometry field? not sure how to decode that at the moment. thanks!

The geometry is stored as a Well Known Binary (WKB) format as per the GeoParquet specification - https://geoparquet.org/releases/v1.0.0/. Examples of readers:

Let me know if you need help understanding the geoparquet schema metadata parser, we can set up a meeting to have a chat.

mattpaul commented 5 months ago

Yeah, I have been looking at other implementations as well. Geopandas itself is too large to import directly or via a lambda layer. I'll check out that rust based impl with the python bindings, thanks.

yellowcap commented 1 month ago

Closing as out of date, feel free to re open if appropriate.