ArroyoSystems / arroyo

Distributed stream processing engine in Rust
https://arroyo.dev
Apache License 2.0
3.67k stars 206 forks source link

Add object store cache for GCS #634

Closed benjamin-awd closed 4 months ago

benjamin-awd commented 4 months ago

Resolves https://github.com/ArroyoSystems/arroyo/issues/621

This PR:

  1. Changes the construct_gcs function to async, which should hopefully improve performance
  2. Adds a cache so to prevent redundant DNS lookups
  3. Adds an alternative way to create the GCS builder via a GOOGLE_SERVICE_ACCOUNT_KEY

Currently, every checkpoint instantiates a ObjectStore leading to a large number of calls to the metadata server. This can lead to a high number of concurrent DNS lookups, which may cause network latency and other undesirable effects. While the GCE metadata service endpoint has no official rate limit, we should avoid making unnecessary calls to it.

Took reference from Polars on this: https://github.com/pola-rs/polars/issues/14384#issuecomment-1948991697, https://github.com/pola-rs/polars/blob/main/crates/polars-io/src/cloud/object_store_setup.rs#L4

Note: the AWS metadata endpoint has a rate limit of 1024 packets per second, so it might be worth implementing this for the construct_s3 function as well at some point.

benjamin-awd commented 4 months ago

This is great! Thanks for the contribution. Can you confirm that this addresses https://github.com/ArroyoSystems/arroyo/issues/621?

It's looking a lot better so far on our end, but will keep an eye on it -- I think our remaining issues are due to how the NATS connector handles checkpointing