csimplestring / delta-go

Native Delta Lake Implementation in Go
37 stars 7 forks source link

cannot parse schema string from delta table metadata #21

Closed giulianopz closed 1 year ago

giulianopz commented 1 year ago

When getting the schema from the metadata of a given delta table, Metadata.Schema() fails due to a type conversion error.

To reproduce the error, run this slightly modified test in the repo:

package examples

import (
    "log"
    "path/filepath"
    "testing"

    delta "github.com/csimplestring/delta-go"
)

func TestLocalExample(t *testing.T) {

    path, err := filepath.Abs("../tests/golden/snapshot-data0")
    if err != nil {
        log.Fatal(err)
    }

    path = "file://" + path + "/"

    config := delta.Config{
        StoreType: "file",
    }

    table, err := delta.ForTable(path, config, &delta.SystemClock{})
    if err != nil {
        log.Fatal(err)
    }

    s, err := table.Snapshot()
    if err != nil {
        log.Fatal(err)
    }

    version := s.Version()
    log.Println(version)

    files, err := s.AllFiles()

    for _, f := range files {
        log.Println(f.Path)
    }

    metadata, err := s.Metadata()
    if err != nil {
        log.Fatal(err)
    }

    schema, err := metadata.Schema() /* this throws an error*/
    if err != nil {
        log.Fatal(err)
    }

    for _, f := range schema.GetFields() {
        log.Default().Println("name=", f.Name, "type=", f.DataType)
    }
}
--------------------
2023/06/26 23:47:55 0
2023/06/26 23:47:55 part-00000-0441e99a-c421-400e-83a1-212aa6c84c73-c000.snappy.parquet
2023/06/26 23:47:55 part-00001-34c8c673-3f44-4fa7-b94e-07357ec28a7d-c000.snappy.parquet
2023/06/26 23:47:55 fail to convert integer to a DataType: 
FAIL    github.com/csimplestring/delta-go/examples  2.016s
FAIL

This is the json schema which the lib fails to parse:

{
    "type": "struct",
    "fields": [
        {
            "name": "col1",
            "type": "integer",
            "nullable": true,
            "metadata": {}
        },
        {
            "name": "col2",
            "type": "string",
            "nullable": true,
            "metadata": {}
        }
    ]
}

The problem is caused by the func nameToType expecting to match the string "int" to IntegerType instead of integer. The Delta protocol mentions "integer" as a primitive type, not "int". I know that this Go implementation closely mirrors the Scala native library, so I cannot really understand why this error happens. Additionally, as far as I know, no breaking changes altered the protocol with respect to the schema serialization format.

csimplestring commented 1 year ago

Thanks for the info ! I will fix this asap. Did you check the scala connector and it gives the same error?

giulianopz commented 1 year ago

No, but I tried the Python bindings for delta-rs which seem to work:

import pathlib

from deltalake import DeltaTable

dt = DeltaTable("../delta-go/tests/golden/snapshot-data0")
print(dt.schema().json())
for f in dt.schema().fields:
    print(f.name)
    print(f.type)
--------------------
{'type': 'struct', 'fields': [{'name': 'col1', 'type': 'integer', 'nullable': True, 'metadata': {}}, {'name': 'col2', 'type': 'string', 'nullable': True, 'metadata': {}}]}
col1
PrimitiveType("integer")
col2
PrimitiveType("string")
csimplestring commented 1 year ago

@giulianopz i fixed this bug and merge it into master now.