Just a thought, but perhaps the hardcoded SPARK_REPARTITION could be radically improved by instead aiming or a target number of records per partition?
Particularly, because metadata records tend to be somewhat similar in size. Alternatively you could aim for a target partition size, but that would require investigating/averaging the size of records. But we have some decent metrics already.
If 657 records is ~ 4.0mb, we can assume that 1 record is ~ 0.164mb. So, if we were to set a target number of records per partition of 5,000, we could infer that a worker will likely encounter a partition of approximately 30.4mb. Which is small-ish, but might be a vast improvement over 200 partitions, even for a small 657 record / 4.0mb data set.
The S3 exporter is attemping something like the following, that could be used elsewhere:
Just a thought, but perhaps the hardcoded
SPARK_REPARTITION
could be radically improved by instead aiming or a target number of records per partition?Particularly, because metadata records tend to be somewhat similar in size. Alternatively you could aim for a target partition size, but that would require investigating/averaging the size of records. But we have some decent metrics already.
If 657 records is ~ 4.0mb, we can assume that 1 record is ~ 0.164mb. So, if we were to set a target number of records per partition of 5,000, we could infer that a worker will likely encounter a partition of approximately 30.4mb. Which is small-ish, but might be a vast improvement over 200 partitions, even for a small 657 record / 4.0mb data set.
The S3 exporter is attemping something like the following, that could be used elsewhere: