Closed kzk2000 closed 3 months ago
@kzk2000 I am running the Redpanda version and not seeing the same issue.
I am seeing a different issue though:
fe.warn.log:2024-06-10 13:14:31.106Z WARN (thrift-server-pool-3|158)
[RoutineLoadJob.unprotectUpdateState():1290] routine load job 10077-example_tbl2_test2 changed to
PAUSED with reason: ErrorReason{errCode = 102, msg='current error rows is more than max error num'}
I will take a look, but first guess is that the schema needs changing.
If it is OK with you I will copy your files into this repo so that we can maintain them here.
The site needs to be a string, not date in script.sql:
`site` string NOT NULL COMMENT "site url",
CREATE DATABASE sr_hub;
USE sr_hub;
CREATE TABLE example_tbl2 (
`uid` bigint NOT NULL COMMENT "uid",
`site` string NOT NULL COMMENT "site url",
`vtime` bigint NOT NULL COMMENT "vtime"
)
DISTRIBUTED BY HASH(`uid`)
PROPERTIES("replication_num"="1");
USE sr_hub;
-- STOP ROUTINE LOAD FOR example_tbl2_test2;
CREATE ROUTINE LOAD example_tbl2_test2 ON example_tbl2
PROPERTIES
(
"format" = "json",
"jsonpaths" ="[\"$.uid\",\"$.site\",\"$.vtime\"]"
)
FROM KAFKA
(
"kafka_broker_list" = "redpanda:29092",
"kafka_topic" = "test2"
);
select * from example_tbl2;
#!/bin/python
# Copyright (c) 2021 Beijing Dingshi Zongheng Technology Co., Ltd. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import random
import time
from kafka import KafkaProducer
def genUid(s=10000):
return random.randint(1, s)
def getSite():
site_scope = ['https://www.starrocks.io/'] * 100 + ['https://www.starrocks.io/blog'] * 34 + \
['https://www.starrocks.io/product/community'] * 12 + ['https://docs.starrocks.io/'] * 55
idx = random.randint(0, len(site_scope) - 1)
return site_scope[idx]
def getTm():
delay_jitter = random.randint(-1800, 0)
chance = random.randint(0, 3)
return int(time.time() + delay_jitter * chance)
"""
{uid:1, site: https://www.starrocks.com/, vtime: 1621410635}
"""
def gen():
data = """{ "uid": %d, "site": "%s", "vtime": %s } """ % (genUid(), getSite(), getTm())
return data
def main():
lines = int(sys.argv[1])
# --advertise-kafka-addr internal://redpanda:29092,external://localhost:9092
producer = KafkaProducer(bootstrap_servers='localhost:9092') # within docker, this is redpanda:29092
for x in range(lines):
data = gen().encode('UTF-8')
print(data)
producer.send('test2', data)
time.sleep(.2)
if __name__ == '__main__':
main()
@DanRoscigno absolutely, please leverage any of my repo code and replicate, correct, refine here
Btw, eventually I wanna use the redpanda version, so happy to stick to that to get it all working
right on, the Redpanda one is the one I modified, works fine now. I will add it in the documentation samples folder and add it to our CI tests, then to the docs.
@kzk2000 this is not published yet, but here is a PDF. Loading with Redpanda to StarRocks using shared-data storage _ StarRocks.pdf
I spent a good few hours with Allen Li on your community slack to get a basic example working. Nothing sticks out from where the error is...
I took various docker code from your existing examples to create a Minio "shared-data" + Kafka setup https://github.com/kzk2000/starrocks-experiments/tree/kafka (branch "kafka" uses regular Kafka, main branch uses "redpanda"
For some reason, the FE and CN don't connect, and the ROUTINE is never entering the RUNNING state.
If you could either help me tweak my repo OR simply provide an example tutorial for Minio "shared-data" + Kafka setup that would be super helpful to get started.