Streaming load regression (performance and errors) from 3.0->3.1

rcauble commented 1 year ago

Steps to reproduce the behavior (Required)

container-repro.zip

unzip container-repro.zip (attached)
cd container-repro
mvn install

Expected behavior (Required)

On 3.0 this succeeds and takes about 35s.

-To verify the 3.0 behavior, edit ConcurrentStreamingLoadTest, changing IMAGE_NAME to starrocks/allin1-ubuntu:3.0.3.

Real behavior (Required)

On 3.1 it fails with "Publish timeout" and takes 1 min, 30s.

StarRocks version (Required)

starrocks/allin1-ubuntu:3.1.0-rc01 (broken) starrocks/allin1-ubuntu:3.0.3 (working)

rcauble commented 1 year ago

I've created an additional experiment here: container-repro-serial.zip

The key difference is that it runs the threads serially and also reduces the number of tables that it attempts to create from 500 to 200. With the reduced tables and concurrency, it succeeds but makes the performance regression much more clear as the total runtime goes to 4 minutes (the same test runs in 42s in SR 3.0.3):

https://gist.github.com/rcauble/5cd34a37de2bdd7f3f23908b4aa1c22b

(With the original experiment, since we fail after 150 or so table creations it understates somewhat the performance impact since we bail out early. The performance seems to get increasing worse with more tables).

A few things to note:

The first table uploads are relatively quick. If you look at the above gist around line 219 is the first table upload > 1s (earlier ones were << 1s). Then around line 326 they start taking > 10s and are > 10s thereafter.
The performance seems also related to the number of columns. The table in question has ~ 450 columns (although only a few of them are populated by the upload request). If I reduce number of columns to < 10 then it completes much quicker (30s).

meegoo commented 1 year ago

Thank you for your testing. I am currently verifying it locally and will sync with you once I have identified the issue. @rcauble

meegoo commented 1 year ago

I've created an additional experiment here: container-repro-serial.zip

The key difference is that it runs the threads serially and also reduces the number of tables that it attempts to create from 500 to 200. With the reduced tables and concurrency, it succeeds but makes the performance regression much more clear as the total runtime goes to 4 minutes (the same test runs in 42s in SR 3.0.3):

https://gist.github.com/rcauble/5cd34a37de2bdd7f3f23908b4aa1c22b

(With the original experiment, since we fail after 150 or so table creations it understates somewhat the performance impact since we bail out early. The performance seems to get increasing worse with more tables).

A few things to note:

The first table uploads are relatively quick. If you look at the above gist around line 219 is the first table upload > 1s (earlier ones were << 1s). Then around line 326 they start taking > 10s and are > 10s thereafter.

The performance seems also related to the number of columns. The table in question has ~ 450 columns (although only a few of them are populated by the upload request). If I reduce number of columns to < 10 then it completes much quicker (30s).

We've located the issue. In version 3.1, we added a feature that triggers asynchronous statistical collection after the first load, which is executed by a thread pool. In this test scenario, the number of collection tasks triggered exceeded the queue number of the thread pool, causing tasks to become synchronous and leading to a timeout. After fixing this issue, due to the added collection operation, the entire task takes approximately 56 seconds. If we turn off the initial collection with the command admin set frontend config ('enable_statistic_collect_on_first_load'='false'); the entire task takes about 36 seconds, which should be similar to the performance of version 3.0.

StarRocks / starrocks