allwefantasy / spark-binlog

A library for querying Binlog with Apache Spark structure streaming, for Spark SQL , DataFrames and [MLSQL](https://www.mlsql.tech).
Apache License 2.0
154 stars 54 forks source link

How to choose the appropriate partition number based on the amount of mysql table data when initializing the delta table? #13

Closed zhengqiangtan closed 4 years ago

zhengqiangtan commented 4 years ago

For example, how many partitions should I initialize for a 40 million mysql table?

allwefantasy commented 4 years ago

The simplest way is Try. However, there are two variables we may should concern.

  1. How many cores will your application have?
  2. How many rows in your table?

Suppose there are 4000 0000 rows, and you have 100 cores to use in your application. For cores, this means you can try 100 * N files. N can be 1/4,1/2, 1, 2, 3. For rows, suppose you try 200 files, this means every file has 20 0000 rows. As we know that every file(partition) can be handled by one core, so it means one core should process 20 0000 rows. I do not think it will take a long time.

In conclusion, the appropriate file number depends on how long you can tolerate one task takes, and then try to make sure the number of the tasks is cores*N, N can be ...1/4,1/2,1,2....

allwefantasy commented 4 years ago

Also, In the doc, we strongly recommend people using partitonByRange to repartition the data when we create the delta table first time.

The two parameters in partitonByRange have a strong impact on the performance of upsert/delete in delta. The first is the file number as mentioned by in this issue, the other is which column we can choose to partition the data into files, normally you can choose the primary key by default.

allwefantasy commented 4 years ago

This blog explains how to optimize the Upsert/Delete performance in Delta.

zhengqiangtan commented 4 years ago

This blog explains how to optimize the Upsert/Delete performance in Delta.

Thanks ! , this is really a very good blog, I will continue to follow