S3 key/folder structure for performance

HenryCaiHaiying commented 1 year ago

For S3 file transfer performance, the S3 objects prefix structure is the key for high performance.

Do you know how the folder structure on S3 is laid out? Is that using a simple hierarchical directory path (e.g. s3://bucket/tiered/topic-1/parttion-1/logsetment-1.log)?

This layout would bottleneck on reading/writing into topic-1/partition-1 since S3 partitioning is on the topic level path (tiered/topic-1), if we can add a random salt hash part into the path (e.g. s3://bucket/tiered/aabbg/topic-1) this would greatly improve the S3 transfer throughput.

HenryCaiHaiying commented 1 year ago

The easiest way to achieve this is is to add the segment_id at the beginning of the S3 path in fileNamePrefix method: https://github.com/aiven/tiered-storage-for-apache-kafka/blob/main/s3/src/main/java/io/aiven/kafka/tiered/storage/s3/S3StorageUtils.java#L34

Something like: s3://bucket/hashed-segment-id/topic/partition/segment.log

mdedetrich commented 1 year ago

@AnatolyPopov can correct me if I am wrong but we are not doing anything special in this regard.

In general I don't have anything against adding functionality like this, i.e. we can add in various mapping functions (i.e. hash or as you said the segment_id and then it can be configured. I think the possible issue here is having to deal with possible migration scenarios (i.e. what happens if you change the configuration?). There is also an open question about saving this configuration as metadata into S3 so that any hypothetical future migration tool has the necessary information (if its a requirement)

HenryCaiHaiying commented 1 year ago

The remote storage metadata is actually stored separately in a Kafka topic (the default implementation of KIP-405). We are supposed to find/resolve the remote location of the topic/partition/segment driven from that Kafka metadata topic.

In general, we should avoid use S3 object listing to find the related files. S3 object listing can quickly become the bottleneck once you have a million small object files on S3 bucket. S3 object listing doesn't recursively traverse based on your folder path, it needs to list that million object paths first. Always use GET/PUT instead of LIST if possible.

mdedetrich commented 1 year ago

The remote storage metadata is actually stored separately in a Kafka topic (the default implementation of KIP-405). We are supposed to find/resolve the remote location of the topic/partition/segment driven from that Kafka metadata topic.

True, but if the mapping functions have data inputs that aren't stored in the topic metadata then that may be problematic (however thats probably jumping the gun).

In general, we should avoid use S3 object listing to find the related files. S3 object listing can quickly become the bottleneck once you have a million small object files on S3 bucket. S3 object listing doesn't recursively traverse based on your folder path, it needs to list that million object paths first. Always use GET/PUT instead of LIST if possible.

Agreed I am aware of this issue as well, pagination is also a big factor behind the bottleneck.

ivanyu commented 1 year ago

AWS S3 used to recommend a (pseudo-)randomized prefix to solve this issue. However some years ago they changed something drastically in how S3 scales and now it's not needed anymore.

From https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/introduction.html:

This guidance supersedes any previous guidance on optimizing performance for Amazon S3. For example, previously Amazon S3 performance guidelines recommended randomizing prefix naming with hashed characters to optimize performance for frequent data retrievals. You no longer have to randomize prefix naming for performance, and can use sequential date-based naming for your prefixes.

Google also does auto-scaling. They do recommend prefix randomization for sequential reads. However, it's unclear what is the key length they operate on (AWS S3 uses 1024 bytes to the best of our knowledge, but it's not official). So this needs additional investigation. Same for Azure.

HenryCaiHaiying commented 1 year ago

From the AWS doc, I don't see how how they improve the performance without prefix randomization. I remember read somewhere else that AWS can analyze the request pattern and starts to shard the prefixes/files within the bucket according to the request pattern. But it usually take some time for this auto-resharding to work since it's a learn and optimize mechanism. In tiered storage use case, it's most likely a one-way direction that Kafka broker dump segment files sequentially onto S3 (those files are rarely read back), not sure whether the S3 writes and be parallelized between different brokers or different threads if the bucket prefix are all similar. But if the topic/partition name is at the beginning part of the S3 path, the S3 parallelization might still be able to work between different topic/partitions.

ivanyu commented 1 year ago

To be open, we're seriously considering this, but also are very careful with this as it comes with losing the ease of discoverability of the segments (due to broken sorting), which probably isn't great from the operations point of view.

Even if we count one partitioned prefix as one Kafka partition, it's about 5k read requests per second (on S3 and GCS). Do you have a requirement that this limit is too low?

Aiven-Open / tiered-storage-for-apache-kafka

S3 key/folder structure for performance #124