StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.74k stars 1.75k forks source link

stream loader for local csv files errors out with runtime state is cancelled #43612

Open rishabhkaushal07 opened 6 months ago

rishabhkaushal07 commented 6 months ago

The CSV load of a big local CSV file to StarRocks (docker: allin1-ubuntu:3.2.4) is getting errored out during the load process. Tried stream broker as well as sling tool and both errors out with the following mentioned errors.

Steps to reproduce the behavior (Required)

curl command:

curl --location-trusted -u root: \
    -H "label:raw_SFU_Fact_Screening_2019_2_try1" \
    -H "format: csv" \
    -H "column_separator: |" \
    -H "enclose: \"" \
    -H "max_filter_ratio: 0.1" \
    -T "file2015.csv" \
    -XPUT http://<IP>:8030/api/db/SFU_Fact_Screening_2019_2/_stream_load

sling command:

export STARROCKS='{ type: starrocks, url: "starrocks://root@<IP>:9030/lifelabs_ckd", fe_url: "http://<IP>:8030" }'

./sling run \
    --src-stream file://file2019.csv \
    --src-options '{"format": "csv", "options": {"delimiter": "|", "header": true}}' \
    --tgt-conn STARROCKS \
    --tgt-object db.SFU_Fact_Screening_2019  \
    --mode full-refresh \
    --debug

Expected behavior (Required)

CSV file should get loaded to the db

Real behavior (Required)

csv load to StarRocks (allin1-ubuntu:3.2.4) is getting failed in between the load process.

sling error:

2024-04-03 23:56:29 INF execution failed
fatal:
--- sling_cli.go:418 func1 ---
--- sling_cli.go:474 cliInit ---
--- cli.go:284 CliProcess ---
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
--- sling_logic.go:224 processRun ---
--- sling_logic.go:371 runTask ---
~ execution failed
--- task_run.go:138 Execute ---

--- database_starrocks.go:504 func4 ---
Failed loading from /tmp/starrocks/db/SFU_Fact_Screening_2019_tmp/2024-04-03T224142.004/part.01.0411.csv into `db`.`SFU_Fact_Screening_2019_tmp`
{
    "TxnId": -1,
    "Label": "56928591-d59d-4593-9c55-b70c287808ca",
    "Status": "Fail",
    "Message": "call frontend service failed, address=TNetworkAddress(hostname=142.58.37.43, port=9020), reason=invalid TType",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 0,
    "BeginTxnTimeMs": 0,
    "StreamLoadPlanTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 0,
    "CommitAndPublishTimeMs": 0
}

context canceled

--- task_run.go:97 func1 ---
~ could not write to database
--- task_run.go:387 runFileToDB ---
~ could not insert into `db`.`SFU_Fact_Screening_2019_tmp`.
--- task_run_write.go:307 WriteToDb ---

--- database_starrocks.go:504 func4 ---
Failed loading from /tmp/starrocks/db/SFU_Fact_Screening_2019_tmp/2024-04-03T224142.004/part.01.0411.csv into `db`.`SFU_Fact_Screening_2019_tmp`
{
    "TxnId": -1,
    "Label": "56928591-d59d-4593-9c55-b70c287808ca",
    "Status": "Fail",
    "Message": "call frontend service failed, address=TNetworkAddress(hostname=142.58.37.43, port=9020), reason=invalid TType",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 0,
    "LoadTimeMs": 0,
    "BeginTxnTimeMs": 0,
    "StreamLoadPlanTimeMs": 0,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 0,
    "CommitAndPublishTimeMs": 0
}

context canceled

LOGS:

be.WARNING log:

/build/starrocks/be/src/storage/meta_reader.cpp:203 _collect(_params->fields[i], _params->cids[i], (*dsts)[i], _params->field_type[i])
/build/starrocks/be/src/storage/meta_reader.cpp:80 _collect_context.seg_collecters[_collect_context.cursor_idx]->collect(&columns)
/build/starrocks/be/src/storage/olap_meta_reader.cpp:156 _read((*result).get(), chunk_capacity - chunk_start)
/build/starrocks/be/src/exec/pipeline/scan/scan_operator.cpp:246 _get_scan_status()
W0404 05:41:26.970769  2585 fragment_mgr.cpp:201] Fail to open fragment fd4fd59b-da23-31e9-18d1-3bf8086fc3bd: Cancelled: Cancelled
/build/starrocks/be/src/runtime/stream_load/stream_load_pipe.cpp:292 _pipe->read(static_cast<uint8_t*>(data), &nread, &eof)
/build/starrocks/be/src/exec/connector_scan_node.cpp:191 _data_source->get_next(state, chunk)
/build/starrocks/be/src/runtime/plan_fragment_executor.cpp:331 _plan->get_next(_runtime_state, &_chunk, &_done)
/build/starrocks/be/src/runtime/plan_fragment_executor.cpp:231 _get_next_internal_vectorized(&chunk)
W0404 05:41:26.973348  2585 stream_load_executor.cpp:111] fragment execute failed, query_id=fd4fd59bda2331e9-18d13bf8086fc3bc, err_msg=Cancelled, id=fd4fd59bda2331e9-18d13bf8086fc3bc, job_id=-1, txn_id: 9105, label=raw_SFU_Fact_Screening_2019_try4_3, db=db
W0404 05:59:05.757376  2867 storage_engine.cpp:1105] tablet not found, remove rowset meta, rowset_id=02000000000001bab3443b7ff0e945b3657eb6d3df00caa7 tablet: 14264
W0404 05:59:05.758733  2867 storage_engine.cpp:1125] traverse_rowset_meta and remove 24/46 invalid rowset metas, path:/data/deploy/starrocks/be/storage duration:2ms
W0404 06:56:12.433996  2853 utils.cpp:120] Fail to report to master: THRIFT_EAGAIN (timed out)
W0404 06:56:12.783007  2853 task_worker_pool.cpp:804] Fail to report resource_usage to 127.0.0.1:9020, err=-1
W0404 06:56:18.212924  2954 thrift_rpc_helper.cpp:86] call frontend service failed, address=TNetworkAddress(hostname=127.0.0.1, port=9020), reason=invalid TType
W0404 08:30:16.205058  2955 thrift_rpc_helper.cpp:86] call frontend service failed, address=TNetworkAddress(hostname=127.0.0.1, port=9020), reason=THRIFT_EAGAIN (timed out)
W0404 08:30:16.979619  2953 stream_load_executor.cpp:187] begin transaction failed, errmsg=unknown table "db.SFU_Fact_Screening_2019_tmp"id=e2423640a6ad445a-bdd336b54add8885, job_id=-1, txn_id: -1, label=d90e6963-a3e6-4235-8a29-025b30a79beb, db=db
W0404 08:37:25.113700  2954 stream_load_executor.cpp:187] begin transaction failed, errmsg=unknown table "db.SFU_Fact_Screening_2019_tmp"id=92478a4e28700661-d526d77f681df8b8, job_id=-1, txn_id: -1, label=c8db17d1-5e04-48ee-9329-8177dd19a746, db=db
W0404 08:47:30.286139  2585 fragment_mgr.cpp:201] Fail to open fragment 5b4118a3-c5be-1089-24cc-e8b7d122c3b8: Cancelled: Cancelled because of runtime state is cancelled
/build/starrocks/be/src/exec/connector_scan_node.cpp:191 _data_source->get_next(state, chunk)
/build/starrocks/be/src/runtime/plan_fragment_executor.cpp:331 _plan->get_next(_runtime_state, &_chunk, &_done)
/build/starrocks/be/src/runtime/plan_fragment_executor.cpp:231 _get_next_internal_vectorized(&chunk)
W0404 08:47:30.288226  2585 stream_load_executor.cpp:111] fragment execute failed, query_id=5b4118a3c5be1089-24cce8b7d122c3b7, err_msg=Cancelled because of runtime state is cancelled, id=5b4118a3c5be1089-24cce8b7d122c3b7, job_id=-1, txn_id: 9792, label=raw_SFU_Fact_Screening_2019_2_try1, db=db
W0404 08:47:30.289914  2952 stream_load.cpp:373] append body content failed. errmsg=Cancelled: Cancelled because of runtime state is cancelled
/build/starrocks/be/src/exec/connector_scan_node.cpp:191 _data_source->get_next(state, chunk)
/build/starrocks/be/src/runtime/plan_fragment_executor.cpp:331 _plan->get_next(_runtime_state, &_chunk, &_done)
/build/starrocks/be/src/runtime/plan_fragment_executor.cpp:231 _get_next_internal_vectorized(&chunk) context=id=5b4118a3c5be1089-24cce8b7d122c3b7, job_id=-1, txn_id: 9792, label=raw_SFU_Fact_Screening_2019_2_try1, db=db
W0404 09:54:36.712942  2585 fragment_mgr.cpp:201] Fail to open fragment 214905c7-8a6f-ef95-1fb4-5009305da5b1: Cancelled: Cancelled because of runtime state is cancelled
/build/starrocks/be/src/exec/connector_scan_node.cpp:191 _data_source->get_next(state, chunk)
/build/starrocks/be/src/runtime/plan_fragment_executor.cpp:331 _plan->get_next(_runtime_state, &_chunk, &_done)
/build/starrocks/be/src/runtime/plan_fragment_executor.cpp:231 _get_next_internal_vectorized(&chunk)
W0404 09:54:36.714660  2585 stream_load_executor.cpp:111] fragment execute failed, query_id=214905c78a6fef95-1fb45009305da5b0, err_msg=Cancelled because of runtime state is cancelled, id=214905c78a6fef95-1fb45009305da5b0, job_id=-1, txn_id: 9793, label=raw_SFU_Fact_Screening_2019_2_try1, db=db
W0404 09:54:36.714680  2953 stream_load.cpp:373] append body content failed. errmsg=Cancelled: Cancelled because of runtime state is cancelled
/build/starrocks/be/src/exec/connector_scan_node.cpp:191 _data_source->get_next(state, chunk)
/build/starrocks/be/src/runtime/plan_fragment_executor.cpp:331 _plan->get_next(_runtime_state, &_chunk, &_done)
/build/starrocks/be/src/runtime/plan_fragment_executor.cpp:231 _get_next_internal_vectorized(&chunk) context=id=214905c78a6fef95-1fb45009305da5b0, job_id=-1, txn_id: 9793, label=raw_SFU_Fact_Screening_2019_2_try1, db=db

fe log:

2024-04-04 08:39:54,181 WARN (JournalWriter|120) [JournalWriter.updateBatchMetrics():226] slow edit log write, batch size: 1, took: 11229ms, current journal queue size: 1, please check the IO pressure of FE LEADER node or the latency between LEADER and FOLLOWER nodes
2024-04-04 08:42:34,650 WARN (JournalWriter|120) [JournalWriter.updateBatchMetrics():226] slow edit log write, batch size: 1, took: 5463ms, current journal queue size: 1, please check the IO pressure of FE LEADER node or the latency between LEADER and FOLLOWER nodes
2024-04-04 08:43:25,321 WARN (JournalWriter|120) [JournalWriter.updateBatchMetrics():226] slow edit log write, batch size: 1, took: 5667ms, current journal queue size: 0, please check the IO pressure of FE LEADER node or the latency between LEADER and FOLLOWER nodes
2024-04-04 08:45:59,103 WARN (JournalWriter|120) [JournalWriter.updateBatchMetrics():226] slow edit log write, batch size: 1, took: 8776ms, current journal queue size: 1, please check the IO pressure of FE LEADER node or the latency between LEADER and FOLLOWER nodes
2024-04-04 08:48:07,205 WARN (thrift-server-pool-10|513) [FrontendServiceImpl.loadTxnRollback():1647] failed to rollback txn 9792: transaction not found: 9792
2024-04-04 09:46:01,757 WARN (JournalWriter|120) [JournalWriter.updateBatchMetrics():226] slow edit log write, batch size: 1, took: 18371ms, current journal queue size: 1, please check the IO pressure of FE LEADER node or the latency between LEADER and FOLLOWER nodes
2024-04-04 09:48:48,635 WARN (JournalWriter|120) [JournalWriter.updateBatchMetrics():226] slow edit log write, batch size: 1, took: 16485ms, current journal queue size: 1, please check the IO pressure of FE LEADER node or the latency between LEADER and FOLLOWER nodes
2024-04-04 09:49:02,256 WARN (JournalWriter|120) [JournalWriter.updateBatchMetrics():226] slow edit log write, batch size: 1, took: 2115ms, current journal queue size: 0, please check the IO pressure of FE LEADER node or the latency between LEADER and FOLLOWER nodes
2024-04-04 09:52:01,977 WARN (JournalWriter|120) [JournalWriter.updateBatchMetrics():226] slow edit log write, batch size: 1, took: 6834ms, current journal queue size: 1, please check the IO pressure of FE LEADER node or the latency between LEADER and FOLLOWER nodes
2024-04-04 09:54:51,320 WARN (thrift-server-pool-20|687) [FrontendServiceImpl.loadTxnRollback():1647] failed to rollback txn 9793: transaction not found: 9793

StarRocks version (Required)

nshangyiming commented 5 months ago

2024-04-04 08:39:54,181 WARN (JournalWriter|120) [JournalWriter.updateBatchMetrics():226] slow edit log write, batch size: 1, took: 11229ms, current journal queue size: 1, please check the IO pressure of FE LEADER node or the latency between LEADER and FOLLOWER nodes

From this log, there probably be two reasons:

  1. FE high gc pressure, you can check the gc log for more details.
  2. FE high disk load.
alberttwong commented 5 months ago

So what is the potential workaround or fix?

jaogoy commented 2 months ago

@banmoy Can you have a look?