Closed zhengqiangtan closed 4 years ago
The simplest way is Try
. However, there are two variables we may should concern.
Suppose there are 4000 0000 rows, and you have 100 cores to use in your application. For cores, this means you can try 100 * N files. N can be 1/4,1/2, 1, 2, 3. For rows, suppose you try 200 files, this means every file has 20 0000 rows. As we know that every file(partition) can be handled by one core, so it means one core should process 20 0000 rows. I do not think it will take a long time.
In conclusion, the appropriate file number depends on how long you can tolerate one task takes, and then try to make sure the number of the tasks is cores*N, N can be ...1/4,1/2,1,2....
Also, In the doc, we strongly recommend people using partitonByRange to repartition the data when we create the delta table first time.
The two parameters in partitonByRange have a strong impact on the performance of upsert/delete in delta. The first is the file number as mentioned by in this issue, the other is which column we can choose to partition the data into files, normally you can choose the primary key by default.
This blog explains how to optimize the Upsert/Delete performance in Delta.
This blog explains how to optimize the Upsert/Delete performance in Delta.
Thanks ! , this is really a very good blog, I will continue to follow
For example, how many partitions should I initialize for a 40 million mysql table?