apache / iceberg-go

Apache Iceberg - Go
https://iceberg.apache.org/
Apache License 2.0
142 stars 34 forks source link

IO Implementation using Go CDK #176

Open loicalleyne opened 1 month ago

loicalleyne commented 1 month ago

Extends PR #111

Implements #92. The Go CDK has well-maintained implementations for accessing objects stores from S3, Azure, and GCS via a io/fs.Fs-like interface. However, their file interface doesn't support the io.ReaderAt interface or the Seek() function that Iceberg-Go requires for files. Furthermore, the File components are private. So we copied the wrappers and implement the remaining functions inside of Iceberg-Go directly.

In addition, we add support for S3 Read IO using the CDK, providing the option to choose between the existing and new implementation using an extra property.

GCS connection options can be passed in properties map.

loicalleyne commented 1 month ago

@dwilson1988 I saw your note about wanting to work on the CDK features, if you're able to provide some feedback that would be great.

dwilson1988 commented 1 month ago

@loicalleyne - happy to take a look. We use this internally in some of our software with Parquet and implemented a ReaderAt. I'll do a more thorough review when I get a chance, but my first thought was to leave it completely separate from the blob.Bucket implementation and let the Create/New funcs simple accept a *blob.Bucket and leave the rest as an exercise to the user. This keeps it more or less completely isolated from the implementation. Thoughts on this direction?

loicalleyne commented 1 month ago

My goal today was just to "get something on paper" to move this forward since the other PR has been stalled since July, I used the other PR as a starting point so I mostly followed the existing patterns. Very open to moving things around if it makes sense. Do you have any idea how your idea would work with the interfaces defined in io.go?

dwilson1988 commented 1 month ago

Understood! I'll dig into your last question and get back to you.

dwilson1988 commented 1 month ago

Okay, played around a bit and here's where my head is at.

The main reason I'd like to isolate the creation of a *blob.Bucket is I've found that the particular implementation of bucket access can get tricky and rather than support it in this package for all situations, support the most common usage in io.LoadFS/inferFileIOFromSchema and change io.CreateBlobFileIO to accept a *url.URL and a *blob.Bucket. This enables a user to open a bucket with whatever implementation they so choose (GCS, Azure, S3, MinIO, Mem, FileSystem, etc) and there's less code here to maintain.

What I came up with is changing CreateBlobFileIO to:

// CreateBlobFileIO creates a new BlobFileIO instance
func CreateBlobFileIO(parsed *url.URL, bucket *blob.Bucket) *BlobFileIO {
    ctx := context.Background()
    return &BlobFileIO{Bucket: bucket, ctx: ctx, opts: &blob.ReaderOptions{}, prefix: parsed.Host + parsed.Path}
}

The URL is still critical there, but now we don't have to concern ourselves with credentials to open the bucket except for in LoadFS.

Thoughts on this?

loicalleyne commented 1 month ago

@dwilson1988 Sounds good, I've made the changes, please take a look.

dwilson1988 commented 1 week ago

@loicalleyne is this still on your radar?

loicalleyne commented 1 week ago

hi @dwilson1988 yes, I'm wrapping up some work on another project and will be jumping back on this in a day or two.

dwilson1988 commented 1 week ago

Cool - just checking. I'll be patient. 🙂

loicalleyne commented 1 week ago

@dwilson1988 made the suggested changes, there's a deprecation warning on the S3 config EndpointResolver methods that I haven't had time to look into, maybe you could take a look?

dwilson1988 commented 6 days ago

@dwilson1988 made the suggested changes, there's a deprecation warning on the S3 config EndpointResolver methods that I haven't had time to look into, maybe you could take a look?

Yes, can probably take a look next week