StarRocks / demo

Apache License 2.0
83 stars 57 forks source link

Provide a "just works" docker-compose example with Kafka + Minio "shared-data" all working E2E #65

Closed kzk2000 closed 3 months ago

kzk2000 commented 4 months ago

I spent a good few hours with Allen Li on your community slack to get a basic example working. Nothing sticks out from where the error is...

I took various docker code from your existing examples to create a Minio "shared-data" + Kafka setup https://github.com/kzk2000/starrocks-experiments/tree/kafka (branch "kafka" uses regular Kafka, main branch uses "redpanda"

For some reason, the FE and CN don't connect, and the ROUTINE is never entering the RUNNING state.

If you could either help me tweak my repo OR simply provide an example tutorial for Minio "shared-data" + Kafka setup that would be super helpful to get started.

DanRoscigno commented 3 months ago

@kzk2000 I am running the Redpanda version and not seeing the same issue.

I am seeing a different issue though:

fe.warn.log:2024-06-10 13:14:31.106Z WARN (thrift-server-pool-3|158)
[RoutineLoadJob.unprotectUpdateState():1290] routine load job 10077-example_tbl2_test2 changed to
PAUSED with reason: ErrorReason{errCode = 102, msg='current error rows is more than max error num'}

I will take a look, but first guess is that the schema needs changing.

DanRoscigno commented 3 months ago

If it is OK with you I will copy your files into this repo so that we can maintain them here.

The site needs to be a string, not date in script.sql:

`site` string NOT NULL COMMENT "site url",

script.sql

CREATE DATABASE sr_hub;

USE sr_hub;

CREATE TABLE example_tbl2 (
    `uid` bigint NOT NULL COMMENT "uid",
    `site` string NOT NULL COMMENT "site url",
    `vtime` bigint NOT NULL COMMENT "vtime"
)
DISTRIBUTED BY HASH(`uid`)
PROPERTIES("replication_num"="1");

USE sr_hub;

-- STOP ROUTINE LOAD FOR example_tbl2_test2;

CREATE ROUTINE LOAD example_tbl2_test2 ON example_tbl2
PROPERTIES
(
    "format" = "json",
    "jsonpaths" ="[\"$.uid\",\"$.site\",\"$.vtime\"]"
)
FROM KAFKA
(
    "kafka_broker_list" = "redpanda:29092",
    "kafka_topic" = "test2"
);

select * from example_tbl2;

gen.py

#!/bin/python
# Copyright (c) 2021 Beijing Dingshi Zongheng Technology Co., Ltd. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.

import sys
import random
import time
from kafka import KafkaProducer

def genUid(s=10000):
    return random.randint(1, s)

def getSite():
    site_scope = ['https://www.starrocks.io/'] * 100 + ['https://www.starrocks.io/blog'] * 34 + \
                  ['https://www.starrocks.io/product/community'] * 12 + ['https://docs.starrocks.io/'] * 55
    idx = random.randint(0, len(site_scope) - 1)
    return site_scope[idx]

def getTm():
    delay_jitter = random.randint(-1800, 0)
    chance = random.randint(0, 3)
    return int(time.time() + delay_jitter * chance)

"""
{uid:1, site: https://www.starrocks.com/, vtime: 1621410635}
"""

def gen():
    data = """{ "uid": %d, "site": "%s", "vtime": %s } """ % (genUid(), getSite(), getTm())
    return data

def main():
    lines = int(sys.argv[1])
    # --advertise-kafka-addr internal://redpanda:29092,external://localhost:9092
    producer = KafkaProducer(bootstrap_servers='localhost:9092')  # within docker, this is redpanda:29092

    for x in range(lines):
        data = gen().encode('UTF-8')
        print(data)
        producer.send('test2', data)
        time.sleep(.2)

if __name__ == '__main__':
    main()
kzk2000 commented 3 months ago

@DanRoscigno absolutely, please leverage any of my repo code and replicate, correct, refine here

kzk2000 commented 3 months ago

Btw, eventually I wanna use the redpanda version, so happy to stick to that to get it all working

DanRoscigno commented 3 months ago

right on, the Redpanda one is the one I modified, works fine now. I will add it in the documentation samples folder and add it to our CI tests, then to the docs.

DanRoscigno commented 3 months ago

@kzk2000 this is not published yet, but here is a PDF. Loading with Redpanda to StarRocks using shared-data storage _ StarRocks.pdf