earth-mover / icechunk

Open-source, cloud-native transactional tensor storage engine
https://icechunk.io
Apache License 2.0
292 stars 17 forks source link

Support Microsoft Azure for the object store #266

Open paraseba opened 1 month ago

kylebarron commented 1 month ago

For both this and https://github.com/earth-mover/icechunk/issues/265, it seems like an ObjectStore-based API is ideal, correct?

In https://github.com/developmentseed/object-store-rs I've been developing pyo3 integration for the object_store crate, that we could reuse in the icechunk Python APIs. (That repo has both a user-facing Python API and a Rust-facing pyo3 integration (not yet well documented)).

I also heard vaguely from @jhamman that you originally used the object_store crate for all backends, but then hit some performance issues that made you switch to using s3 directly. Can you elaborate on that? The object_store maintainers are very performance sensitive so I'd 1) be surprised that you could get better performance from using the s3 APIs natively and 2) if so the object_store maintainers would probably be interested in your use case. (If preferred, we can move this discussion to a separate thread).

paraseba commented 1 month ago

@kylebarron We had many random very weird errors with object_store when used with high concurrency. But this was the last week before launch, so we didn't have time to explore. The errors looked like malformed HTTP responses, very weird. It happened trying to use the dask_write.py example we have. You can give it a try with several threads doing big writes. Easiest course of action at the time was to switch to aws's implementation. Errors immediately disappeared and performance was awesome again.

I'm personally not a fan of object_store. I think the "one instance per bucket" decision is too committal, and I'm always disappointed by how little compatibility there is between the different implementations (example, metadata support, sorting of keys in list, etc). But I acknowledge the value of having an implementation compatible across platforms.

To support Google and Microsoft, I think using ObjectStore would be a great first approach, maybe even a definitive one if performance and stability are good.

geospatial-jeff commented 1 month ago

We had many random very weird errors with object_store when used with high concurrency. But this was the last week before launch, so we didn't have time to explore. The errors looked like malformed HTTP responses, very weird.

@paraseba I believe this could be due to a difference in how credentialization was implemented in the object_store version of the code compared to AWS SDK. For deployments, object store will fall back to the node profile (credentials provider) when it finds no other form of authentication on the machine. This is a remote network call (PUT request) to an AWS token endpoint that is sent before each HTTP request to the blob store. At high concurrencies you may run into rate limits, account-level quotas, and other failures that cause malformed tokens. I've seen this myself running both boto3 and gdal at big scales - both have similar fallback logic.

The original object_store implementation would fall back to node profile if it hit this else block, and no credentials were configured in the environment - https://github.com/earth-mover/icechunk/blob/e3a119cddfcbe7b38173faeb0b4eb403bd69de7b/icechunk/src/storage/object_store.rs#L92-L94

The AWS SDK does not implement this fall back behavior unless you explicitly configure it in the code, and this was not configured when switching the code over to the AWS SDK (#135). Which would explain why this issue went away with the switch.

This is of course a lot of guesswork based on past experiences, it really depends how the code / dask cluster was deployed and configured. Local cluster on an EC2 instance? Distributed cluster on k8s? Running from your local machine? All have different failure cases when it comes to credentialization.

Either way I don't think this is an issue with object_store, most likely related to how credentials are configured.

kylebarron commented 3 weeks ago

I'm interested in implementing this (and potentially #265 at the same time) via the Python bindings.

Unless you object, I'd like to use pyo3-object-store for this. I haven't documented it yet, but it's designed as a way to export constructors for the object_store crate once with a single API, but reuse them across multiple Rust-Python bindings.

Any feedback?

mpiannucci commented 3 weeks ago

Unless you object, I'd like to use pyo3-object-store for this. I haven't documented it yet, but it's designed as a way to export constructors for the object_store crate once with a single API, but reuse them across multiple Rust-Python bindings.

Can you explain why it would make sense to use pyo3-object-store instead ofusing the object_store crate itself in rust? I see your point about the single API interface but all of the storage is currently implemented in the rust core, and we will be using this core to support other languages (and platforms) that are not python based and want to make sure we keep that as a core value.

kylebarron commented 3 weeks ago

pyo3-object_store doesn't replace the object_store crate. It provides Python builders so that Python users can create Arc<dyn ObjectStore> instances.

For example, all of storage.rs just exports a very small subset of the configuration options to create S3 instances.

I'll publish and document pyo3-object_store tomorrow, but it essentially allows you to

  1. register the builders. This means that Python users can access the object-store constructors from icechunk.store, i.e. icechunk.store.S3Store or icechunk.store.AzureStore, etc.
  2. Then just accept PyObjectStore as a parameter in your function exported to Python. PyObjectStore::into_inner gives you an Arc<dyn ObjectStore>.

This is just for the Python binding. It wouldn't alter any piece of how you already use object-store in the main icechunk crate.

kylebarron commented 3 weeks ago

For example, all of storage.rs just exports a very small subset of the configuration options to create S3 instances.

That is, if you're able to switch back to object-store for S3 as well, you'd be able to delete that entire file by re-exporting the structs from pyo3-object_store

mpiannucci commented 3 weeks ago

Got it. So most of that makes sense. Trying to understand how the pyobjectstore configs look in Python land, because our biggest reason for using custom structs is that we all dislike the fsspec method of using keyword args for configuration. So using the structs allows us to strictly type the configuration as best we can.

I see that the config is a hashmap with specific keys on the rust side, but on the Python side how do those config bindings appear?

kylebarron commented 3 weeks ago

There's strict type hinting for all Python exports. E.g. look at https://github.com/developmentseed/obstore/blob/e1ed7659fdea8a0f4bc9eb2e813f6713c125e368/obstore/python/obstore/store/_aws.pyi. You'd be re-exporting that same class essentially. Or https://github.com/developmentseed/obstore/blob/e1ed7659fdea8a0f4bc9eb2e813f6713c125e368/obstore/python/obstore/store/_azure.pyi for Azure.

Note that the config parameter is a dict with these specified options; there's also separate options for the HTTP client and for the retry config.

Note that since the underlying Arc<dyn ObjectStore> is not ABI-stable, you'd need to export these builders from your own Python library. Even though the class names are the same, it wouldn't work to pass in an obstore.store.AzureStore to icechunk-python. You'd need to instruct users to construct an icechunk.store.AzureStore. This is a limitation of dynamic linking with pyo3 for non-ABI-stable objects, see https://github.com/PyO3/pyo3/issues/1444

mpiannucci commented 3 weeks ago

@geospatial-jeff

The AWS SDK does not implement this fall back behavior unless you explicitly configure it in the code, and this was not configured when switching the code over to the AWS SDK (https://github.com/earth-mover/icechunk/pull/135). Which would explain why this issue went away with the switch.

It seems the AWS sdk does indeed load the config from env though by default (https://github.com/earth-mover/icechunk/blob/2c9a311c4d69c6cff099eb6474b525193d1ec01c/icechunk/src/storage/s3.rs#L85), so i am confused by this statement. Am i missing something here?

I am responding to this inline here because it is related to the effort to support more stores through object store and s3 being compatible.

kylebarron commented 3 weeks ago

FromEnv is probably referring to environment variables, not node instance credentials. Those are two different things.

mpiannucci commented 3 weeks ago

FromEnv in the icechunk version does nothing and simply creates an S3 client using the default behavior, which loads from env vars and instance creds etc:

https://github.com/earth-mover/icechunk/blob/2c9a311c4d69c6cff099eb6474b525193d1ec01c/icechunk/src/storage/s3.rs#L76

I tried to find it but ObjectStore seems contrived, if we dont explicitly say we to load from env vars will it work the same way?

kylebarron commented 2 weeks ago

I published an initial beta of pyo3-object_store v0.1.0-beta.1: https://crates.io/crates/pyo3-object_store

I'll write more documentation on Monday and then I can lay out my proposal of how the integration would work.

mpiannucci commented 2 weeks ago

Sounds great!

paraseba commented 2 weeks ago

I'd really like it if we could have a single object_store based Storage implementation. These are the things I think need to happen for that:

  1. Implement and test, currently object_store is used mostly in tests only, so we'll need to gain confidence
  2. Implement our virtual chunk management with multiple ObjectStore instances. Last time I checked each instance of ObjectStore can handle a single bucket (which I hate, but it is what it is)
  3. Verify performance. We moved to S3 client because performance was really bad, but this comment offers some hope.

To convince ourselves of point 3 above, we can use our dask_write example. This is how we discovered the performance issues before:

python examples/dask_write.py --url s3://some-bucket/some-prefix create
python examples/dask_write.py --url s3://some-bucket/some-prefix update --t-from 0 --t-to 256 --workers 16  
python examples/dask_write.py --url s3://some-bucket/some-prefix verify --t-from 0 --t-to 256 --workers 16  

As long as object_store can reproduce (close to) the performance of the current code, we are good.

kylebarron commented 2 weeks ago

I personally have quite high confidence in the object_store crate because it's used in production by some very big projects, including the crates.io registry and any downstream project from datafusion. Some really big companies depend on datafusion, like Apple, which recently donated its datafusion comet project to Apache.

Understandably we need to get more confidence with object_store in icechunk.

I see the artificially_sort_refs_in_mem flag. Is it crucial for you to get the list result back in sorted order? I don't recall which stores actually do that. I see that ObjectStore doesn't guarantee the order of results.

  • Implement our virtual chunk management with multiple ObjectStore instances. Last time I checked each instance of ObjectStore can handle a single bucket (which I hate, but it is what it is)

This is true, it's connected to a single bucket.

paraseba commented 2 weeks ago

I see the artificially_sort_refs_in_mem flag. Is it crucial for you to get the list result back in sorted order?

Yes that is crucial. Our in mem store is used for tests and exploration, so we don't expect to have many keys, the artificial sort works there. And with the "normal" object stores, object_store preserves the ordering. So I'm not too concerned about this point.

Another similar one is some object_store implementation don't handle metadata, which increases the complexity of the code a lot. You'll find many instances of if self.supports_metadata.

Understandably we need to get more confidence with object_store in icechunk.

Yeah .. Icechunk use of the object store could be somewhat particular, we want to read and write many relatively small objects, with very high concurrency and quickly.

kylebarron commented 2 weeks ago

Another similar one is some object_store implementation don't handle metadata

Can you elaborate on this? You're referring to these attributes in the GetResult?

kylebarron commented 2 weeks ago

You'll find many instances of if self.supports_metadata.

I can only find two instances of the string supports_metadata on current main https://github.com/earth-mover/icechunk/blob/fc62650c662b1d1dc9a924571cc45b6652f33758/icechunk/src/storage/object_store.rs#L211-L230

https://github.com/earth-mover/icechunk/blob/fc62650c662b1d1dc9a924571cc45b6652f33758/icechunk/src/storage/object_store.rs#L252-L271

And from the code it looks like AWS, GCS, and Azure all support writing attributes. Or maybe it was with local/in-memory stores that you had issues?

paraseba commented 2 weeks ago

Right, the local store is the one not supporting user metadata, you'll see we create it with:

supports_metadata: false,

Metadata could become a more important need in the near future, and we don't have an answer for local filesystem support. Thankfully we can still use the in memory store in tests, but it will make things more complicated. For example, to ensure we maintain format compatibility across versions, it would be great to have a local store that supports metadata.

paraseba commented 2 weeks ago

By the way, thank you so much for working on this @kylebarron ! If we manage to make it work it's going to be a big improvement to Icechunk