apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
316 stars 116 forks source link

Rename `data_sequence_number` to `sequence_number` #893

Open Fokko opened 3 days ago

Fokko commented 3 days ago

Feature Request / Improvement

It looks like a misnamed field slipped in:

{
    "status": 1,
    "snapshot_id": {
        "long": 898025966831056900
    },
    "data_sequence_number": null,
    "file_sequence_number": null,
    "data_file": {
        "content": 0,
        "file_path": "/tmp/some.db/tablev2/data/00000-0-93717a88-1cea-4e3d-a69a-00ce3d087822.parquet",
        "file_format": "PARQUET",
        "partition": {},
        "record_count": 3,
        "file_size_in_bytes": 5459,
        "column_sizes": { ... },
        "value_counts": { ... },
        "null_value_counts": { ... },
        "nan_value_counts": { ... },
        "lower_bounds": { ... },
        "upper_bounds": { ... },
        "key_metadata": null,
        "split_offsets": {
            "array": [
                4
            ]
        },
        "equality_ids": null,
        "sort_order_id": null
    }
}

This should be sequence_number:

image

Luckily this still worked due to Iceberg's field-id based lookup, but would be good to get this cleaned up.

Relevant code:

https://github.com/apache/iceberg-python/blob/a8d3f17d42b00b507a3522714fe431a18124493e/pyiceberg/manifest.py#L380

kevinjqliu commented 3 days ago

Is there a way on the Java/spark side to turn metadata information into JSON? With #535, perhaps we can compare the two JSON results and check for mismatches like this one.

soumya-ghosh commented 2 days ago

@Fokko I would like to take a shot at this one.

Fokko commented 2 days ago

@soumya-ghosh Feel free to take a stab at it, let me know if you run into anything

Fokko commented 2 days ago

Is there a way on the Java/spark side to turn metadata information into JSON? With https://github.com/apache/iceberg-python/issues/535, perhaps we can compare the two JSON results and check for mismatches like this one.

That would be an interesting idea. We could take the PySpark schema and turn it into an Iceberg schema and compare the two (or just compare the Avro schemas)

soumya-ghosh commented 1 day ago

@Fokko the PR https://github.com/apache/iceberg-python/pull/900 is ready for review.