apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
467 stars 169 forks source link

[feat request] Make `Table` / `TableMetadata` JSON serializable #535

Open kevinjqliu opened 7 months ago

kevinjqliu commented 7 months ago

Feature Request / Improvement

The REST Catalog exposes Table and TableMetadata information as HTTP endpoints in JSON format (link). This information is similar to the internal state of Table and TableMetadata objects in Python.

It would be great to make these JSON serializable.

Example

from pyiceberg.catalog import load_catalog
import json
catalog = load_catalog()
tbl = catalog.load_table("default.taxi_dataset")
json.dumps(vars(tbl))

Error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kevinliu/.pyenv/versions/3.11.0/lib/python3.11/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kevinliu/.pyenv/versions/3.11.0/lib/python3.11/json/encoder.py", line 200, in encode
    chunks = self.iterencode(o, _one_shot=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kevinliu/.pyenv/versions/3.11.0/lib/python3.11/json/encoder.py", line 258, in iterencode
    return _iterencode(o, 0)
           ^^^^^^^^^^^^^^^^^
  File "/Users/kevinliu/.pyenv/versions/3.11.0/lib/python3.11/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type Table is not JSON serializable
>>> json.dumps(vars(tbl))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kevinliu/.pyenv/versions/3.11.0/lib/python3.11/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kevinliu/.pyenv/versions/3.11.0/lib/python3.11/json/encoder.py", line 200, in encode
    chunks = self.iterencode(o, _one_shot=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kevinliu/.pyenv/versions/3.11.0/lib/python3.11/json/encoder.py", line 258, in iterencode
    return _iterencode(o, 0)
           ^^^^^^^^^^^^^^^^^
  File "/Users/kevinliu/.pyenv/versions/3.11.0/lib/python3.11/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type TableMetadataV1 is not JSON serializable
Fokko commented 7 months ago

We should be able to (de)serialize it using Pydantic. That's probably also faster.

kevinjqliu commented 7 months ago

oh thanks for the hint, looks like using the model_dump_json function works.

from pyiceberg.catalog import load_catalog
import json
catalog = load_catalog()
tbl = catalog.load_table("default.taxi_dataset")
tbl.metadata.model_dump_json()

but only on tbl.metadata and not tbl.

kevinjqliu commented 7 months ago

There's already a __repr__ function defined for the Table object. @Fokko what do you think about adding another function for Table which will output the JSON representation?

db-trin-life commented 1 week ago

@kevinjqliu if no one is on this, can look to take this on

kevinjqliu commented 1 week ago

@db-trin-life yep assigned to you!