Netflix / suro

Netflix's distributed Data Pipeline
Apache License 2.0
794 stars 170 forks source link

Potential scalability problem with default RemotePrefixFormatter for S3 Sink #179

Closed dmcguire81 closed 5 years ago

dmcguire81 commented 9 years ago

It looks like the default behavior for the S3FileSink, as inherited from RemoteFileSink, is to prefix filenames with the ISO 8601 formatted date. Doesn't this pose a huge scalability bottleneck with respect to S3's key space partitioning scheme? The primary concern is that S3 partitioning can't be don't on reverse indexes (keys), so there is no distribution of keys among partitions when the uncommon prefix is sequential, like an ISO 8601 date (from the docs: "Using a sequential prefix, such as timestamp or an alphabetical sequence, increases the likelihood that Amazon S3 will target a specific partition for a large number of your keys, overwhelming the I/O capacity of the partition"). What would be the best way to hook in and override this behavior?

metacret commented 9 years ago

Hi David

I agree with you. So, we're using custom implementation of RemotePrefixFormatter in Netflix to avoid uneven partitioning in S3. You can implement similar stuff using DynamicRemotePrefixFormatter. I copied our internal code as the following:

public class SuroPrefixFormatter implements RemotePrefixFormatter {
    public static final String TYPE = "suroprefix";

    private static SimpleDateFormat day = new java.text.SimpleDateFormat("HH'/'yyyyMMdd'/'");

    private final String prefix;

    @JsonCreator
    public SuroPrefixFormatter(@JsonProperty("prefix") String prefix) {
        this.prefix = prefix == null ? "suro" : prefix;
    }

    @Override
    public String get() {
        return prefix + "/" +
                NetflixConfiguration.getServerId().substring(2, 4) + "/" +
                NetflixConfiguration.getServerId().substring(2) + "/" +
                day.format(new Date()) +
                NetflixConfiguration.getRegion() + "/" +
                NetflixConfiguration.getStack() + "/";
    }
}

NetflixConfiguration.getServerId() returns AWS EC2 instance id such as i-abcdefg.