Open loicalleyne opened 1 month ago
@dwilson1988 I saw your note about wanting to work on the CDK features, if you're able to provide some feedback that would be great.
@loicalleyne - happy to take a look. We use this internally in some of our software with Parquet and implemented a ReaderAt. I'll do a more thorough review when I get a chance, but my first thought was to leave it completely separate from the blob.Bucket
implementation and let the Create/New
funcs simple accept a *blob.Bucket
and leave the rest as an exercise to the user. This keeps it more or less completely isolated from the implementation. Thoughts on this direction?
My goal today was just to "get something on paper" to move this forward since the other PR has been stalled since July, I used the other PR as a starting point so I mostly followed the existing patterns. Very open to moving things around if it makes sense. Do you have any idea how your idea would work with the interfaces defined in io.go?
Understood! I'll dig into your last question and get back to you.
Okay, played around a bit and here's where my head is at.
The main reason I'd like to isolate the creation of a *blob.Bucket
is I've found that the particular implementation of bucket access can get tricky and rather than support it in this package for all situations, support the most common usage in io.LoadFS/inferFileIOFromSchema
and change io.CreateBlobFileIO
to accept a *url.URL
and a *blob.Bucket
. This enables a user to open a bucket with whatever implementation they so choose (GCS, Azure, S3, MinIO, Mem, FileSystem, etc) and there's less code here to maintain.
What I came up with is changing CreateBlobFileIO
to:
// CreateBlobFileIO creates a new BlobFileIO instance
func CreateBlobFileIO(parsed *url.URL, bucket *blob.Bucket) *BlobFileIO {
ctx := context.Background()
return &BlobFileIO{Bucket: bucket, ctx: ctx, opts: &blob.ReaderOptions{}, prefix: parsed.Host + parsed.Path}
}
The URL is still critical there, but now we don't have to concern ourselves with credentials to open the bucket except for in LoadFS
.
Thoughts on this?
@dwilson1988 Sounds good, I've made the changes, please take a look.
@loicalleyne is this still on your radar?
hi @dwilson1988 yes, I'm wrapping up some work on another project and will be jumping back on this in a day or two.
Cool - just checking. I'll be patient. 🙂
@dwilson1988 made the suggested changes, there's a deprecation warning on the S3 config EndpointResolver methods that I haven't had time to look into, maybe you could take a look?
@dwilson1988 made the suggested changes, there's a deprecation warning on the S3 config EndpointResolver methods that I haven't had time to look into, maybe you could take a look?
Yes, can probably take a look next week
Extends PR #111
Implements #92. The Go CDK has well-maintained implementations for accessing objects stores from S3, Azure, and GCS via a io/fs.Fs-like interface. However, their file interface doesn't support the io.ReaderAt interface or the Seek() function that Iceberg-Go requires for files. Furthermore, the File components are private. So we copied the wrappers and implement the remaining functions inside of Iceberg-Go directly.
In addition, we add support for S3 Read IO using the CDK, providing the option to choose between the existing and new implementation using an extra property.
GCS connection options can be passed in properties map.