Closed dmcguire81 closed 5 years ago
Hi David
I agree with you. So, we're using custom implementation of RemotePrefixFormatter in Netflix to avoid uneven partitioning in S3. You can implement similar stuff using DynamicRemotePrefixFormatter. I copied our internal code as the following:
public class SuroPrefixFormatter implements RemotePrefixFormatter {
public static final String TYPE = "suroprefix";
private static SimpleDateFormat day = new java.text.SimpleDateFormat("HH'/'yyyyMMdd'/'");
private final String prefix;
@JsonCreator
public SuroPrefixFormatter(@JsonProperty("prefix") String prefix) {
this.prefix = prefix == null ? "suro" : prefix;
}
@Override
public String get() {
return prefix + "/" +
NetflixConfiguration.getServerId().substring(2, 4) + "/" +
NetflixConfiguration.getServerId().substring(2) + "/" +
day.format(new Date()) +
NetflixConfiguration.getRegion() + "/" +
NetflixConfiguration.getStack() + "/";
}
}
NetflixConfiguration.getServerId() returns AWS EC2 instance id such as i-abcdefg.
It looks like the default behavior for the S3FileSink, as inherited from RemoteFileSink, is to prefix filenames with the ISO 8601 formatted date. Doesn't this pose a huge scalability bottleneck with respect to S3's key space partitioning scheme? The primary concern is that S3 partitioning can't be don't on reverse indexes (keys), so there is no distribution of keys among partitions when the uncommon prefix is sequential, like an ISO 8601 date (from the docs: "Using a sequential prefix, such as timestamp or an alphabetical sequence, increases the likelihood that Amazon S3 will target a specific partition for a large number of your keys, overwhelming the I/O capacity of the partition"). What would be the best way to hook in and override this behavior?