MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

improve repartitioning in Spark jobs #373

Open ghukill opened 5 years ago

ghukill commented 5 years ago

Just a thought, but perhaps the hardcoded SPARK_REPARTITION could be radically improved by instead aiming or a target number of records per partition?

Particularly, because metadata records tend to be somewhat similar in size. Alternatively you could aim for a target partition size, but that would require investigating/averaging the size of records. But we have some decent metrics already.

If 657 records is ~ 4.0mb, we can assume that 1 record is ~ 0.164mb. So, if we were to set a target number of records per partition of 5,000, we could infer that a worker will likely encounter a partition of approximately 30.4mb. Which is small-ish, but might be a vast improvement over 200 partitions, even for a small 657 record / 4.0mb data set.

The S3 exporter is attemping something like the following, that could be used elsewhere:

repartition(math.ceil(rdd_to_write.count() / settings.TARGET_RECORDS_PER_PARTITION))