StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
8.94k stars 1.79k forks source link

Insert fail: Cancelled because of runtime state is cancelled #51003

Open melvynator opened 2 months ago

melvynator commented 2 months ago

I'm reproducing this benchmark on the latest stable version: https://releases.starrocks.io/starrocks/StarRocks-3.1.15-centos-amd64.tar.gz

Steps to reproduce the behavior (Required)

@chaoyli I'm trying to reproduce this benchmark: https://github.com/ClickHouse/ClickBench/blob/main/starrocks/benchmark.sh

Whenever I run this section:

START=$(date +%s)
echo "Start to load data..."
curl --location-trusted \
    -u root: \
    -T "hits.tsv" \
    -H "label:hits_tsv_${START}" \
    -H "columns: WatchID,JavaEnable,Title,GoodEvent,EventTime,EventDate,CounterID,ClientIP,RegionID,UserID,CounterClass,OS,UserAgent,URL,Referer,IsRefresh,RefererCategoryID,RefererRegionID,URLCategoryID,URLRegionID,ResolutionWidth,ResolutionHeight,ResolutionDepth,FlashMajor,FlashMinor,FlashMinor2,NetMajor,NetMinor,UserAgentMajor,UserAgentMinor,CookieEnable,JavascriptEnable,IsMobile,MobilePhone,MobilePhoneModel,Params,IPNetworkID,TraficSourceID,SearchEngineID,SearchPhrase,AdvEngineID,IsArtifical,WindowClientWidth,WindowClientHeight,ClientTimeZone,ClientEventTime,SilverlightVersion1,SilverlightVersion2,SilverlightVersion3,SilverlightVersion4,PageCharset,CodeVersion,IsLink,IsDownload,IsNotBounce,FUniqID,OriginalURL,HID,IsOldCounter,IsEvent,IsParameter,DontCountHits,WithHash,HitColor,LocalEventTime,Age,Sex,Income,Interests,Robotness,RemoteIP,WindowName,OpenerName,HistoryLength,BrowserLanguage,BrowserCountry,SocialNetwork,SocialAction,HTTPError,SendTiming,DNSTiming,ConnectTiming,ResponseStartTiming,ResponseEndTiming,FetchTiming,SocialSourceNetworkID,SocialSourcePage,ParamPrice,ParamOrderID,ParamCurrency,ParamCurrencyID,OpenstatServiceName,OpenstatCampaignID,OpenstatAdID,OpenstatSourceID,UTMSource,UTMMedium,UTMCampaign,UTMContent,UTMTerm,FromTag,HasGCLID,RefererHash,URLHash,CLID" \
    http://localhost:8030/api/hits/hits/_stream_load
END=$(date +%s)
LOADTIME=$(echo "$END - $START" | bc)
echo "Load data costs $LOADTIME seconds"

Starrocks fail to insert data with an obscure message.

Expected behavior (Required)

The insert should work.

Real behavior (Required)

I've got the following error:

Start to load data...
{
    "TxnId": 3,
    "Label": "hits_tsv_1726064200",
    "Status": "Fail",
    "Message": "Cancelled because of runtime state is cancelled",
    "NumberTotalRows": 0,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 51771936768,
    "LoadTimeMs": 768067,
    "BeginTxnTimeMs": 43,
    "StreamLoadPlanTimeMs": 139,
    "ReadDataTimeMs": 519766,
    "WriteDataTimeMs": 600756,
    "CommitAndPublishTimeMs": 0
}Load data costs 768 seconds

StarRocks version (Required)

3.1.15

melvynator commented 2 months ago

In be.INFO I can find the following:

W0912 11:56:37.092803 43311 fragment_mgr.cpp:201] Fail to open fragment e8435333-64f6-ac43-8de0-7390222881b5: Cancelled: Cancelled because of runtime state is cancelled
/build/starrocks/be/src/exec/connector_scan_node.cpp:66 _data_source->get_next(state, chunk)
/build/starrocks/be/src/runtime/plan_fragment_executor.cpp:322 _plan->get_next(_runtime_state, &_chunk, &_done)
/build/starrocks/be/src/runtime/plan_fragment_executor.cpp:228 _get_next_internal_vectorized(&chunk)
I0912 11:56:37.094070 43311 plan_fragment_executor.cpp:483] Fragment e8435333-64f6-ac43-8de0-7390222881b5:(Active: 10m, non-child: 0.23%)
   - InstanceAllocatedMemoryUsage: 233.76 GB
   - InstanceDeallocatedMemoryUsage: 143.10 GB
   - InstancePeakMemoryUsage: 237.51 MB
   - MemoryLimit: -1.00 B
   - RowsProduced: 87.63M
  OlapTableSink:(Active: 3m, non-child: 29.97%)
     - TxnID: 3
     - IndexNum: 1
     - ReplicatedStorage: true
     - AutomaticPartition: false
     - AllocAutoIncrementTime: 4.290ms
     - CloseWaitTime: 0.000ns
     - OpenTime: 1.572ms
     - PrepareDataTime: 9s057ms
       - ConvertChunkTime: 49.047ms
       - ValidateDataTime: 7s773ms
     - RowsFiltered: 0
     - RowsRead: 87.63M
     - RowsReturned: 87.63M
     - RpcClientSideTime: 3m31s
     - RpcServerSideTime: 0.000ns
     - RpcServerWaitFlushTime: 0.000ns
     - SendDataTime: 2m50s
       - PackChunkTime: 42s199ms
       - SendRpcTime: 0.000ns
         - CompressTime: 0.000ns
         - SerializeChunkTime: 0.000ns
       - WaitResponseTime: 2m5s
  FILE_SCAN_NODE (id=0):(Active: 6m59s, non-child: 69.80%)
     - BytesRead: 0
     - IOTaskExecTime: 0.000ns
     - IOTaskWaitTime: 0.000ns
     - NumDiskAccess: 0
     - PeakMemoryUsage: 0
     - RowsRead: 0
     - RowsReturned: 87.63M
     - RowsReturnedRate: 208.98 K/sec
     - ScanTime: 8m19s
     - ScannerQueueCounter: 147
     - ScannerQueueTime: 9m56s
     - ScannerThreadsInvoluntaryContextSwitches: 0
     - ScannerThreadsTotalWallClockTime: 0.000ns
       - MaterializeTupleTime(*): 0.000ns
       - ScannerThreadsSysTime: 0.000ns
       - ScannerThreadsUserTime: 0.000ns
     - ScannerThreadsVoluntaryContextSwitches: 0
     - TotalRawReadTime(*): 0.000ns
     - TotalReadThroughput: 0.00 /sec
    DataSource:
       - DataSourceType: FileDataSource
       - FileScanner: 0
         - CastChunkTime: 0.000ns
         - CreateChunkTime: 0.000ns
         - FileReadCount: 0
         - FileReadTime: 0.000ns
         - FillTime: 0.000ns
         - MaterializeTime: 0.000ns
         - ReadTime: 0.000ns
       - ScannerTotalTime: 0.000ns
W0912 11:56:37.094082 43311 stream_load_executor.cpp:102] fragment execute failed, query_id=e843533364f6ac43-8de07390222881b4, err_msg=Cancelled because of runtime state is cancelled, id=e843533364f6ac43-8de07390222881b4, job_id=-1, txn_id: 3, label=hits_tsv_1726141596, db=hits
W0912 11:56:37.094142 43680 stream_load.cpp:373] append body content failed. errmsg=Cancelled: Cancelled because of runtime state is cancelled
/build/starrocks/be/src/exec/connector_scan_node.cpp:66 _data_source->get_next(state, chunk)
/build/starrocks/be/src/runtime/plan_fragment_executor.cpp:322 _plan->get_next(_runtime_state, &_chunk, &_done)
/build/starrocks/be/src/runtime/plan_fragment_executor.cpp:228 _get_next_internal_vectorized(&chunk) context=id=e843533364f6ac43-8de07390222881b4, job_id=-1, txn_id: 3, label=hits_tsv_1726141596, db=hits
melvynator commented 2 months ago

In the fe.warn.log I found:

2024-09-12 11:57:51.348Z WARN (thrift-server-pool-0|165) [FrontendServiceImpl.loadTxnRollback():1489] failed to rollback txn 3: transaction not found: 3

chaoyli commented 1 month ago

@melvynator Sorry for late delay. Let me have a reproduce in the 3.1.15

melvynator commented 1 month ago

@chaoyli Any luck?

chaoyli commented 1 month ago

@melvynator No luck, I'll push our team to fix it this week. They are on the national holiday last week.

silverbullet233 commented 1 month ago

@melvynator thanks for your feedback. which type of instance did you test on? Could you provide some more details about your test environment?

xiangguangyxg commented 1 month ago

@melvynator Thanks for your feedback. I tried to reproduce this issue in 3.1.15, but I inserted data successfully. Could you provide some details about your test to help us reproduce this issue ?

melvynator commented 1 month ago

Yes, I used a c6a.4xlarge instance on AWS with a Linux flavor.

I was able to load the data, but to do so, I resorted to cutting the file into 5 parts and inserting them each individually; this approach worked.

xiangguangyxg commented 1 month ago

Yes, I used a c6a.4xlarge instance on AWS with a Linux flavor.

I was able to load the data, but to do so, I resorted to cutting the file into 5 parts and inserting them each individually; this approach worked.

@melvynator Thanks for your feedback. I guess the reason for the insert failure is that the c6a.4xlarge instance has lower specifications.