airbnb / chronon

Chronon is a data platform for serving for AI/ML applications.
Apache License 2.0
745 stars 52 forks source link

Use a min of 200 parallelism for job write #874

Closed pengyu-hou closed 2 weeks ago

pengyu-hou commented 2 weeks ago

Summary

The current code calculates the parallelism to avoid creating too many small files. However, the logic does not work well with the group by batch upload job because its output table is an unpartitioned table. Therefore, it leads really low parallelism like around 10 to write the content.

This PR will use a min value of 200 for write parallelism. It can guarantee a min parallelism for such scenario.

Why / Goal

The goal is to improve the performance for the group by batch upload job writing.

Test Plan

Checklist

Reviewers

@pkundurthy @hzding621 @yuli-han