apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
383 stars 140 forks source link

TypeError when `operation` field is missing in `summary`. #1106

Open questsul opened 2 weeks ago

questsul commented 2 weeks ago

Apache Iceberg version

0.6.0

Please describe the bug 🐞

When attempting to read the metadata.json file, which contains a list of snapshots where some snapshot summaries lack the operation field, PyIceberg encounters the following error:

TypeError: Summary.init() missing 1 required positional argument: 'operation'.

Interestingly, when parsing the same metadata file using the Iceberg Java library, it works without any issues.

Full stack trace:

 File "reader.py", line 91, in _get_iceberg_table
    return StaticTable.from_metadata(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.12/site-packages/pyiceberg/table/__init__.py", line 1101, in from_metadata
    metadata = FromInputFile.table_metadata(file)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.12/site-packages/pyiceberg/serializers.py", line 113, in table_metadata
    return FromByteStream.table_metadata(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.12/site-packages/pyiceberg/serializers.py", line 94, in table_metadata
    return TableMetadataUtil.parse_raw(metadata)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.12/site-packages/pyiceberg/table/metadata.py", line 461, in parse_raw
    return TableMetadataWrapper.model_validate_json(data).root
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.12/site-packages/pydantic/main.py", line 580, in model_validate_json
    return cls.__pydantic_validator__.validate_json(json_data, strict=strict, context=context)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Summary.__init__() missing 1 required positional argument: 'operation'

Metadata.json example:

{
  "format-version" : 2,
  "table-uuid" : "9996bcdf-3277-48f4-9e76-9e81766c9e0e",
  "location" : "file://t/some_table/",
  "last-sequence-number" : 45,
  "last-updated-ms" : 1724611070351,
  "last-column-id" : 79,
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "DATA",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "COLUMN_NAME",
      "required" : false,
      "type" : "string"
    }]
  } ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ ]
  } ],
  "last-partition-id" : 999,
  "default-sort-order-id" : 0,
  "sort-orders" : [ {
    "order-id" : 0,
    "fields" : [ ]
  } ],
  "properties" : {
    "format-version" : "2"
  },
  "current-snapshot-id" : 1724611070351000000,
  "snapshots" : [ {
    "sequence-number" : 44,
    "snapshot-id" : 1724610129117000000,
    "timestamp-ms" : 1724610129117,
    "manifest-list" : "file://t/some_table/metadata/snap-1724610129117000000-d9b50309-0dff-472d-8711-86ca70021ffb.avro",
    "schema-id" : 0,
    "summary" : {
      "manifests-created" : "8",
      "total-records" : "26508666891",
      "added-files-size" : "3927895626752",
      "manifests-kept" : "0",
      "total-files-size" : "3927895626752",
      "added-records" : "26508666891",
      "added-data-files" : "231513",
      "manifests-replaced" : "0",
      "total-data-files" : "231513"
    }
  }, {
    "sequence-number" : 43,
    "snapshot-id" : 1724006578422000000,
    "timestamp-ms" : 1724006578422,
    "manifest-list" : "file://t/some_table/metadata/snap-1724006578422000000-289566b5-78fe-4b60-9ffa-ab25dee1edde.avro",
    "schema-id" : 0,
    "summary" : {
      "total-files-size" : "3888310341632",
      "added-records" : "26224534820",
      "added-data-files" : "225313",
      "manifests-replaced" : "0",
      "total-data-files" : "225313",
      "manifests-created" : "56",
      "total-records" : "26224534820",
      "added-files-size" : "3888310341632",
      "manifests-kept" : "0"
    }
  }, {
    "sequence-number" : 45,
    "snapshot-id" : 1724611070351000000,
    "timestamp-ms" : 1724611070351,
    "manifest-list" : "file://t/some_table/metadata/snap-1724611070351000000-6a307203-7148-467f-88eb-f932b32dd7f4.avro",
    "schema-id" : 0,
    "summary" : {
      "added-files-size" : "3929709293568",
      "total-records" : "26508666891",
      "manifests-created" : "8",
      "total-data-files" : "227581",
      "manifests-replaced" : "0",
      "added-data-files" : "227581",
      "added-records" : "26508666891",
      "total-files-size" : "3929709293568",
      "operation" : "append",
      "manifests-kept" : "0"
    }
  } ],
  "snapshot-log" : [ {
    "snapshot-id" : 1724006578422000000,
    "timestamp-ms" : 1724006578422
  }, {
    "snapshot-id" : 1724610129117000000,
    "timestamp-ms" : 1724610129117
  }, {
    "snapshot-id" : 1724611070351000000,
    "timestamp-ms" : 1724611070351
  } ]
}
questsul commented 2 weeks ago

How to reproduce: For pyicberg I was using metadata file stored in Azure blob storage

    static_table = StaticTable.from_metadata(
        "abfs://path/metadata/example.metadata.json",
        properties={
            "adlfs.connection-string": "ADD THIS",
        },
    )

Here is java snippet I used for verification:

 package com.example;

import org.apache.iceberg.StaticTableOperations;
import org.apache.iceberg.TableMetadata;
import org.apache.iceberg.inmemory.InMemoryFileIO;
import org.apache.iceberg.io.FileIO;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;

import java.nio.file.Paths;

public class StaticTableExample {
    public static void main(String[] args) throws IOException {

    StaticTableExample ex = new StaticTableExample();

    ex.start();
    }

    public void start() throws IOException {
        InMemoryFileIO fileIO = new InMemoryFileIO();

        String metadataFilePath = "example.metadata.json";

        fileIO.addFile(metadataFilePath, readFileAsBytesNIO(metadataFilePath));

        StaticTableOperations ops = new StaticTableOperations(
                metadataFilePath,
                fileIO
        );

        TableMetadata meta = ops.current();

        System.out.println("Table location: " + meta.location());
    }

    public byte[] readFileAsBytesNIO(String filePath) throws IOException {
        Path path = Path.of(filePath);
        return Files.readAllBytes(path);
    }
}
questsul commented 2 weeks ago

I'm not entirely sure if I'm looking at the correct code, but it seems that in Java, the operation field might be optional during parsing. For example, it appears that operation can be set to null:

String operation = null;

The metadata.json file I provided was produced by Snowflake. After Snowflake made some updates to their Iceberg implementation, they began creating metadata files in this format. Previously, there were no issues reading Snowflake Iceberg tables using PyIceberg.

sungwy commented 2 weeks ago

Hi @questsul - thank you for raising this issue, and for providing this analysis. The optional and required attributes in PyIceberg are based on the nullability of the objects as they are defined within the Rest Catalog Open API spec. Here, the operation field in summary is labeled to be a required attribute:

https://github.com/apache/iceberg/blob/8e2eb9ac2e33ce4bac8956d4e2f099444d03c0e3/open-api/rest-catalog-open-api.yaml#L2057-L2060

I think there's a few takeaways here based on our findings:

  1. operation is a required field
  2. Java is interestingly more graceful in parsing the operation tag (and it probably should not be)
  3. The metadata files in your table are unfortunately not conformant to the REST API Spec. I think following up with the folks at Snowflake and seeing if there's a way to fix these metadata files (for example by making a copy and adding operation tags that are missing) may not be a bad idea
kevinjqliu commented 2 weeks ago

Java is interestingly more graceful in parsing the operation tag (and it probably should not be)

@sungwy does this mean that the current JAVA implementation does not adhere to the spec? If so, we should open a ticket to track