StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
9.01k stars 1.81k forks source link

Stream load has problems with ingesting multiple CSV files in a short amount of time. #43359

Closed alberttwong closed 1 month ago

alberttwong commented 7 months ago

See https://github.com/slingdata-io/sling-cli/issues/229

alberttwong commented 7 months ago

@MarkovWangRR FYI

rishabhkaushal07 commented 7 months ago

+1 on this issue.

Encountered the following error on loading multiple csv files (via separate terminals but using the same db connection) onto SR (docker - allin1-ubuntu:3.2.4) using sling. Sling starts loading the data fine but in the middle of the load, the above errors come up for different csv files.

Error:

12m30s 16,743,351 23008 r/s 11 GB12m31s 16,766,702 23030 r/s 11 GB12m32s 16,789,944 23044 r/s 11 GB12m33s 16,813,222 23059 r/s 11 GB12m34s 16,836,468 23071 r/s 11 GB | 23% MEM | 28% CPU 2024-04-02 11:14:16 DBG stream-load completed for /tmp/starrocks/db/SFU_Fact_Screening_2017_tmp/2024-04-02T110141.174/part.01.0067.csv => {
    "TxnId": -1,
    "Label": "579a248b-c0f3-429f-a7be-4f7918fa5bdb",
    "Status": "Fail",
    "Message": "call frontend service failed, address=TNetworkAddress(hostname=<host_ip>, port=9020), reason=THRIFT_EAGAIN (timed out)",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 0,
    "BeginTxnTimeMs": 0,
    "StreamLoadPlanTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 0,
    "CommitAndPublishTimeMs": 0
}
12m35s 16,875,853 22389 r/s 11 GB | 23% MEM | 43% CPU 
2024-04-02 11:14:22 DBG loading /tmp/starrocks/db/SFU_Fact_Screening_2017_tmp/2024-04-02T110141.174/part.01.0068.csv [164 MB] ds.1712080900942.fro-0
2024-04-02 11:14:22 DBG drop table if exists `db`.`SFU_Fact_Screening_2017_tmp`
2024-04-02 11:14:22 DBG table `db`.`SFU_Fact_Screening_2017_tmp` dropped
2024-04-02 11:14:22 DBG closed "starrocks" connection (conn-starrocks-UAB)
2024-04-02 11:14:22 INF execution failed

fatal:
--- sling_cli.go:418 func1 ---
--- sling_cli.go:474 cliInit ---
--- cli.go:284 CliProcess ---
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
--- sling_logic.go:224 processRun ---
--- sling_logic.go:371 runTask ---
~ execution failed
--- task_run.go:138 Execute ---

--- database_starrocks.go:504 func4 ---
Failed loading from /tmp/starrocks/db/SFU_Fact_Screening_2017_tmp/2024-04-02T110141.174/part.01.0067.csv into `db`.`SFU_Fact_Screening_2017_tmp`
{
    "TxnId": -1,
    "Label": "579a248b-c0f3-429f-a7be-4f7918fa5bdb",
    "Status": "Fail",
    "Message": "call frontend service failed, address=TNetworkAddress(hostname=<host_ip>, port=9020), reason=THRIFT_EAGAIN (timed out)",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 0,
    "BeginTxnTimeMs": 0,
    "StreamLoadPlanTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 0,
    "CommitAndPublishTimeMs": 0
}

context canceled

--- task_run.go:97 func1 ---
~ could not write to database
--- task_run.go:387 runFileToDB ---
~ could not insert into `db`.`SFU_Fact_Screening_2017_tmp`.
--- task_run_write.go:307 WriteToDb ---

--- database_starrocks.go:504 func4 ---
Failed loading from /tmp/starrocks/db/SFU_Fact_Screening_2017_tmp/2024-04-02T110141.174/part.01.0067.csv into `db`.`SFU_Fact_Screening_2017_tmp`
{
    "TxnId": -1,
    "Label": "579a248b-c0f3-429f-a7be-4f7918fa5bdb",
    "Status": "Fail",
    "Message": "call frontend service failed, address=TNetworkAddress(hostname=<host_ip>, port=9020), reason=THRIFT_EAGAIN (timed out)",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 0,
    "BeginTxnTimeMs": 0,
    "StreamLoadPlanTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 0,
    "CommitAndPublishTimeMs": 0
}

context canceled

Following is our sling connection details: export STARROCKS='{ type: starrocks, url: "starrocks://root@<host_ip>:9030/db", fe_url: "http://<host_ip>:8030" }'

sling command:

./sling run \
    --src-stream file:///SFU_Fact_Screening_2017.csv \
    --src-options '{"format": "csv", "options": {"delimiter": "|", "header": true}}' \
    --tgt-conn STARROCKS \
    --tgt-object db.SFU_Fact_Screening_2017  \
    --mode full-refresh \
    --debug
nshangyiming commented 7 months ago

call frontend service failed, address=TNetworkAddress(hostname=, port=9020), reason=THRIFT_EAGAIN (timed out)

This may probably be caused by lock contention in FE when you have too many concurrent stream load jobs. You can check fe.log and search for slow db lock to see whether there is heavy lock contetion. And also to avoid this error, you can adjust the timeout of stream load job, like -H "timeout:300", doc ref: https://docs.starrocks.io/docs/sql-reference/sql-statements/data-manipulation/STREAM_LOAD/#set-timeout-period.

github-actions[bot] commented 1 month ago

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!