apache / iceberg-go

Apache Iceberg - Go
https://iceberg.apache.org/
Apache License 2.0
142 stars 34 forks source link

Manifest List/Entry Creation #172

Open dwilson1988 opened 1 month ago

dwilson1988 commented 1 month ago

Feature Request / Improvement

Hello, I'm working on a use case where I need to be my own catalog and need to be able to create my own Iceberg tables purely in Go. I understand that table creation through a catalog is one of the design goals, but direct creation of manifests (snapshots, manifest lists/entries, data file metadata) does not appear to be supported unless I'm missing something. My use case is fairly straightforward:

  1. crawl a filesystem/object store for parquet files
  2. gather column level statistics and file level metadata
  3. build up a single snapshot for the results
  4. create table metadata for this snapshot and keep track of this in a separate store.

I could be missing something, but it appears all of the concrete structs are un-exported and I don't see any external interface to create them. Is this within the design goals of this module? If so, where does it stand on the priorities? I will be started work on this in fairly short order and plan to use this module to at least read tables. I'd like to be able to use it to write as well.

I'm more than happy to contribute this, if desired and would love some guidance on how you'd like to see behavior like this implemented.

In addition, I notice a gocloud CDK PR that seems to have stalled out. Seeing as I also need this functionality, I'm happy to help take this across the finish line (though I might take a step back and rethink the design a little bit)

@zeroshade

zeroshade commented 1 month ago

Thanks for filing this!

I understand that table creation through a catalog is one of the design goals, but direct creation of manifests (snapshots, manifest lists/entries, data file metadata) does not appear to be supported unless I'm missing something.

Currently we have concrete Manifest Builder objects in manifest.go for constructing manifest files while https://github.com/apache/iceberg-go/pull/146 is adding more generalized manifest building, snapshot additions, data file handling etc.

Is this within the design goals of this module? If so, where does it stand on the priorities? I will be started work on this in fairly short order and plan to use this module to at least read tables. I'd like to be able to use it to write as well.

I'm more than happy to contribute this, if desired and would love some guidance on how you'd like to see behavior like this implemented.

It is definitely within the design goals of this module to have full write support to construct metadata, snapshots, partitions and everything. In general, Builder pattern type handling seems to be the safest for the APIs in this package to ensure all of the moving parts are updated appropriately and consistent. A source of inspiration in this package has been to use pyiceberg as a starting point for developing interfaces followed by then making it more idiomatic for Go.

I would happily review any PRs that are put up and help get things implemented. My current priorities are on the read side currently as you can see with my recent PRs, with write support planned afterwards. But if you are going to be developing it anyways, I'd love the contribution.

In addition, I notice a gocloud CDK PR that seems to have stalled out. Seeing as I also need this functionality, I'm happy to help take this across the finish line (though I might take a step back and rethink the design a little bit)

That would be fantastic! I would greatly appreciate it