Closed alberttwong closed 7 months ago
Cool! The discover function uses this query: https://github.com/flarco/dbio/blob/main/database/templates/mysql.yaml#L95-L122
Do you think something needs adjustment? I will create a new template for starrocks
.
atwong@Albert-CelerData tpcds-parquet % sling run --src-stream 'file:///Users/atwong/sandbox/tpcds-parquet/call_center.parquet' --tgt-conn MYSQL --tgt-object 'albert.callcenter' --mode full-refresh
12:08PM INF connecting to target database (mysql)
12:08PM INF reading from source file system (file)
12:08PM INF writing to target database [mode: full-refresh]
12:08PM INF streaming data
12:08PM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
context canceled
~ insertBatch
context canceled
~ could not prepare Tx: INSERT INTO `albert`.`callcenter_tmp` (`cc_call_center_sk`, `cc_call_center_id`, `cc_rec_start_date`, `cc_rec_end_date`, `cc_closed_date_sk`, `cc_open_date_sk`, `cc_name`, `cc_class`, `cc_employees`, `cc_sq_ft`, `cc_hours`, `cc_manager`, `cc_mkt_id`, `cc_mkt_class`, `cc_mkt_desc`, `cc_market_manager`, `cc_division`, `cc_division_name`, `cc_company`, `cc_company_name`, `cc_street_number`, `cc_street_name`, `cc_street_type`, `cc_suite_number`, `cc_city`, `cc_county`, `cc_state`, `cc_zip`, `cc_country`, `cc_gmt_offset`, `cc_tax_percentage`, `_sling_loaded_at`) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
Error 1064: Unsupported command(COM_STMT_PREPARE)
~ could not insert into `albert`.`callcenter_tmp`.
context canceled
~ insertBatch
context canceled
~ could not prepare Tx: INSERT INTO `albert`.`callcenter_tmp` (`cc_call_center_sk`, `cc_call_center_id`, `cc_rec_start_date`, `cc_rec_end_date`, `cc_closed_date_sk`, `cc_open_date_sk`, `cc_name`, `cc_class`, `cc_employees`, `cc_sq_ft`, `cc_hours`, `cc_manager`, `cc_mkt_id`, `cc_mkt_class`, `cc_mkt_desc`, `cc_market_manager`, `cc_division`, `cc_division_name`, `cc_company`, `cc_company_name`, `cc_street_number`, `cc_street_name`, `cc_street_type`, `cc_suite_number`, `cc_city`, `cc_county`, `cc_state`, `cc_zip`, `cc_country`, `cc_gmt_offset`, `cc_tax_percentage`, `_sling_loaded_at`) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
Error 1064: Unsupported command(COM_STMT_PREPARE)
Cool, I figured so, I saw a COPY
command in the docs but not sure how it works yet.
What's the best way to bulk insert from a MySQL client?
What's the best way to bulk insert from a MySQL client?
INSERT INTO employees (id, name, age, department) VALUES (1, 'Peter Parker', 19, 'IT'), (2, 'Bruce Wayne', 29, 'Automobile'), (3, 'Tony Stark', 45, 'Finance'), (4, 'Clark Kent', 50, 'Media');
The issue is that we emulate the mysql wire protocol. We have 3 different table types (duplicate, aggregate and primary key). so the select is fine but table creation is going to be a problem. We need something like this
create database albert;
use albert;
CREATE TABLE employees (
id INT KEY,
name VARCHAR(50),
age INT,
department VARCHAR(50)
) PRIMARY KEY (id) DISTRIBUTED BY HASH(id);
The discover function uses this query: https://github.com/flarco/dbio/blob/main/database/templates/mysql.yaml#L95-L122
I don't know if it's weird but select table_catalog from information_schema.columns;
shows null on my end so I had to remove it from your query. I logged another github issue on my end. https://github.com/StarRocks/starrocks/issues/40482
with tables as (
select
table_catalog,
table_schema,
table_name,
case table_type
when 'VIEW' then true
else false
end as is_view
from information_schema.tables
where table_schema = 'tpcds' and table_name = 'call_center'
)
select
cols.table_schema as schema_name,
cols.table_name as table_name,
tables.is_view as is_view,
cols.column_name as column_name,
cols.data_type as data_type,
cols.ordinal_position as position
from information_schema.columns cols
join tables
on tables.table_schema = cols.table_schema
and tables.table_name = cols.table_name
order by cols.table_catalog, cols.table_schema, cols.table_name, cols.ordinal_position;
I think this should work. please confirm that discover is supposed to output this.
StarRocks > with tables as (
-> select
-> table_catalog,
-> table_schema,
-> table_name,
-> case table_type
-> when 'VIEW' then true
-> else false
-> end as is_view
-> from information_schema.tables
-> where table_schema = 'tpcds' and table_name = 'call_center'
-> )
-> select
-> cols.table_schema as schema_name,
-> cols.table_name as table_name,
-> tables.is_view as is_view,
-> cols.column_name as column_name,
-> cols.data_type as data_type,
-> cols.ordinal_position as position
-> from information_schema.columns cols
-> join tables
-> on tables.table_schema = cols.table_schema
-> and tables.table_name = cols.table_name
-> order by cols.table_catalog, cols.table_schema, cols.table_name, cols.ordinal_position;
+-------------+-------------+---------+-------------------+-----------+----------+
| schema_name | table_name | is_view | column_name | data_type | position |
+-------------+-------------+---------+-------------------+-----------+----------+
| tpcds | call_center | 0 | cc_call_center_sk | int | 1 |
| tpcds | call_center | 0 | cc_call_center_id | varchar | 2 |
| tpcds | call_center | 0 | cc_rec_start_date | date | 3 |
| tpcds | call_center | 0 | cc_rec_end_date | date | 4 |
| tpcds | call_center | 0 | cc_closed_date_sk | int | 5 |
| tpcds | call_center | 0 | cc_open_date_sk | int | 6 |
| tpcds | call_center | 0 | cc_name | varchar | 7 |
| tpcds | call_center | 0 | cc_class | varchar | 8 |
| tpcds | call_center | 0 | cc_employees | int | 9 |
| tpcds | call_center | 0 | cc_sq_ft | int | 10 |
| tpcds | call_center | 0 | cc_hours | varchar | 11 |
| tpcds | call_center | 0 | cc_manager | varchar | 12 |
| tpcds | call_center | 0 | cc_mkt_id | int | 13 |
| tpcds | call_center | 0 | cc_mkt_class | varchar | 14 |
| tpcds | call_center | 0 | cc_mkt_desc | varchar | 15 |
| tpcds | call_center | 0 | cc_market_manager | varchar | 16 |
| tpcds | call_center | 0 | cc_division | int | 17 |
| tpcds | call_center | 0 | cc_division_name | varchar | 18 |
| tpcds | call_center | 0 | cc_company | int | 19 |
| tpcds | call_center | 0 | cc_company_name | varchar | 20 |
| tpcds | call_center | 0 | cc_street_number | varchar | 21 |
| tpcds | call_center | 0 | cc_street_name | varchar | 22 |
| tpcds | call_center | 0 | cc_street_type | varchar | 23 |
| tpcds | call_center | 0 | cc_suite_number | varchar | 24 |
| tpcds | call_center | 0 | cc_city | varchar | 25 |
| tpcds | call_center | 0 | cc_county | varchar | 26 |
| tpcds | call_center | 0 | cc_state | varchar | 27 |
| tpcds | call_center | 0 | cc_zip | varchar | 28 |
| tpcds | call_center | 0 | cc_country | varchar | 29 |
| tpcds | call_center | 0 | cc_gmt_offset | decimal | 30 |
| tpcds | call_center | 0 | cc_tax_percentage | decimal | 31 |
+-------------+-------------+---------+-------------------+-----------+----------+
31 rows in set (0.08 sec)
I think this should work. please confirm that discover is supposed to output this.
Yes looks good, thanks.
I'll take it from here. The DDL should be fine, just need to figure out how to push the primary key through.
Question about that, is the PRIMARY KEY (id) DISTRIBUTED BY HASH(id)
part mandatory? Not too sure what is best practice.
Question about that, is the
PRIMARY KEY (id) DISTRIBUTED BY HASH(id)
part mandatory? Not too sure what is best practice.
So it can be one of several types. For you to load data into a temp table, I would recommend primary key table for row uniqueness of which then can be inserted to the final table type. Also I happen to pick the primary key as id but it can be a combination of columns.
CREATE TABLE IF NOT EXISTS detail (
event_time DATETIME NOT NULL COMMENT "datetime of event",
event_type INT NOT NULL COMMENT "type of event",
user_id INT COMMENT "id of user",
device_code INT COMMENT "device code",
channel INT COMMENT ""
)
DUPLICATE KEY(event_time, event_type)
DISTRIBUTED BY HASH(user_id);
CREATE TABLE IF NOT EXISTS example_db.aggregate_tbl (
site_id LARGEINT NOT NULL COMMENT "id of site",
date DATE NOT NULL COMMENT "time of event",
city_code VARCHAR(20) COMMENT "city_code of user",
pv BIGINT SUM DEFAULT "0" COMMENT "total page views"
)
AGGREGATE KEY(site_id, date, city_code)
DISTRIBUTED BY HASH(site_id)
PROPERTIES (
"replication_num" = "3"
);
@alberttwong trying to stand up a docker instance, following the steps here: https://docs.starrocks.io/docs/quick_start/deploy_with_docker/
It's not working it seems... any advice? See log output below.
It keeps printing WARN FE service query port:9030 is NOT alive yet!
.
Here is the my config in my docker-compose.yaml:
starrocks:
image: starrocks/allin1-ubuntu
restart: unless-stopped
hostname: devbox-starrocks
container_name: devbox-starrocks
environment: {}
ports:
- "9030:9030"
- "8030:8030"
- "8040:8040"
I also tried with docker run --rm -p 9030:9030 -p 8030:8030 -p 8040:8040 -it starrocks/allin1-ubuntu
, same result.
larco@mpc:/__/devbox$ dcp up starrocks
Recreating devbox-starrocks ... done
Attaching to devbox-starrocks
devbox-starrocks | ****************************************************************************************************
devbox-starrocks | * ________ _________ ________ ________ ________ ________ ________ ___ __ ________
devbox-starrocks | * |\ ____\|\___ ___\\ __ \|\ __ \|\ __ \|\ __ \|\ ____\|\ \|\ \ |\ ____\
devbox-starrocks | * \ \ \___|\|___ \ \_\ \ \|\ \ \ \|\ \ \ \|\ \ \ \|\ \ \ \___|\ \ \/ /|\ \ \___|_
devbox-starrocks | * \ \_____ \ \ \ \ \ \ __ \ \ _ _\ \ _ _\ \ \\\ \ \ \ \ \ ___ \ \_____ \
devbox-starrocks | * \|____|\ \ \ \ \ \ \ \ \ \ \ \\ \\ \ \\ \\ \ \\\ \ \ \____\ \ \\ \ \|____|\ \
devbox-starrocks | * ____\_\ \ \ \__\ \ \__\ \__\ \__\\ _\\ \__\\ _\\ \_______\ \_______\ \__\\ \__\____\_\ \
devbox-starrocks | * |\_________\ \|__| \|__|\|__|\|__|\|__|\|__|\|__|\|_______|\|_______|\|__| \|__|\_________\
devbox-starrocks | * \|_________| \|_________|
devbox-starrocks | *
devbox-starrocks | * ________ ___ ___ ___ ________ _____
devbox-starrocks | * |\ __ \|\ \ |\ \ |\ \|\ ___ \ / __ \
devbox-starrocks | * \ \ \|\ \ \ \ \ \ \ \ \ \ \ \\ \ \|\/_|\ \
devbox-starrocks | * \ \ __ \ \ \ \ \ \ \ \ \ \ \\ \ \|/ \ \ \
devbox-starrocks | * \ \ \ \ \ \ \____\ \ \____\ \ \ \ \\ \ \ \ \ \
devbox-starrocks | * \ \__\ \__\ \_______\ \_______\ \__\ \__\\ \__\ \ \__\
devbox-starrocks | * \|__|\|__|\|_______|\|_______|\|__|\|__| \|__| \|__|
devbox-starrocks | *
devbox-starrocks | ****************************************************************************************************
devbox-starrocks |
devbox-starrocks | StarRocks [(Blazing Fast)]> _
devbox-starrocks |
devbox-starrocks |
devbox-starrocks | 2024-02-03 12:15:36,052 INFO Set uid to user 0 succeeded
devbox-starrocks | 2024-02-03 12:15:36,055 INFO RPC interface 'supervisor' initialized
devbox-starrocks | 2024-02-03 12:15:36,055 CRIT Server 'unix_http_server' running without any HTTP authentication checking
devbox-starrocks | 2024-02-03 12:15:36,055 INFO supervisord started with pid 6
devbox-starrocks | 2024-02-03 12:15:37,058 INFO spawned: 'beservice' with pid 524
devbox-starrocks | 2024-02-03 12:15:37,059 INFO spawned: 'broker' with pid 525
devbox-starrocks | 2024-02-03 12:15:37,061 INFO spawned: 'director' with pid 526
devbox-starrocks | 2024-02-03 12:15:37,062 INFO spawned: 'feproxy' with pid 527
devbox-starrocks | 2024-02-03 12:15:37,064 INFO spawned: 'feservice' with pid 530
devbox-starrocks | 2024-02-03 12:15:37+00:00 INFO checking if need to perform auto registring Backend and Broker ...
devbox-starrocks | 2024-02-03 12:15:37,205 INFO exited: broker (exit status 0; not expected)
devbox-starrocks | 2024-02-03 12:15:37+00:00 INFO checking if FE service query port:9030 alive or not ...
devbox-starrocks | 2024-02-03 12:15:37+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:37,759 INFO exited: feservice (exit status 1; not expected)
devbox-starrocks | 2024-02-03 12:15:38,764 INFO spawned: 'broker' with pid 2186
devbox-starrocks | 2024-02-03 12:15:38,766 INFO spawned: 'feservice' with pid 2187
devbox-starrocks | 2024-02-03 12:15:38,767 INFO exited: beservice (exit status 127; not expected)
devbox-starrocks | 2024-02-03 12:15:38,871 INFO exited: broker (exit status 0; not expected)
devbox-starrocks | 2024-02-03 12:15:39+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:39,343 INFO exited: feservice (exit status 1; not expected)
devbox-starrocks | 2024-02-03 12:15:40,349 INFO spawned: 'beservice' with pid 2860
devbox-starrocks | 2024-02-03 12:15:41+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:41,349 INFO spawned: 'broker' with pid 3324
devbox-starrocks | 2024-02-03 12:15:41,350 INFO spawned: 'feservice' with pid 3325
devbox-starrocks | 2024-02-03 12:15:41,351 INFO exited: beservice (exit status 127; not expected)
devbox-starrocks | 2024-02-03 12:15:41,470 INFO exited: broker (exit status 0; not expected)
devbox-starrocks | 2024-02-03 12:15:42,705 INFO success: director entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
devbox-starrocks | 2024-02-03 12:15:42,705 INFO success: feproxy entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
devbox-starrocks | 2024-02-03 12:15:42,705 INFO exited: feservice (exit status 1; not expected)
devbox-starrocks | 2024-02-03 12:15:43+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:43,361 INFO spawned: 'beservice' with pid 3998
devbox-starrocks | 2024-02-03 12:15:44,605 INFO spawned: 'broker' with pid 4459
devbox-starrocks | 2024-02-03 12:15:44,606 INFO exited: beservice (exit status 127; not expected)
devbox-starrocks | 2024-02-03 12:15:44,707 INFO exited: broker (exit status 0; not expected)
devbox-starrocks | 2024-02-03 12:15:45+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:45,368 INFO gave up: broker entered FATAL state, too many start retries too quickly
devbox-starrocks | 2024-02-03 12:15:46,373 INFO spawned: 'feservice' with pid 4603
devbox-starrocks | 2024-02-03 12:15:47+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:47,373 INFO exited: feservice (exit status 1; not expected)
devbox-starrocks | 2024-02-03 12:15:48,378 INFO spawned: 'beservice' with pid 5136
devbox-starrocks | 2024-02-03 12:15:48,379 INFO gave up: feservice entered FATAL state, too many start retries too quickly
devbox-starrocks | 2024-02-03 12:15:49+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:49,383 INFO exited: beservice (exit status 127; not expected)
devbox-starrocks | 2024-02-03 12:15:50,385 INFO gave up: beservice entered FATAL state, too many start retries too quickly
devbox-starrocks | 2024-02-03 12:15:51+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:53+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:55+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:57+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:59+00:00 WARN FE service query port:9030 is NOT alive yet!
^CGracefully stopping... (press Ctrl+C again to force)
Stopping devbox-starrocks ... done
@flarco did you try only docker run -p 9030:9030 -p 8030:8030 -p 8040:8040 \ -itd starrocks/allin1-ubuntu
?
atwong@Albert-CelerData ~ % docker pull starrocks/allin1-ubuntu
Using default tag: latest
latest: Pulling from starrocks/allin1-ubuntu
Digest: sha256:cc3ef33b6a812ae256027f15fb56764ecdc2eb378b986cf2cf367292504442e8
Status: Image is up to date for starrocks/allin1-ubuntu:latest
docker.io/starrocks/allin1-ubuntu:latest
What's Next?
View a summary of image vulnerabilities and recommendations → docker scout quickview starrocks/allin1-ubuntu
atwong@Albert-CelerData ~ % docker run --rm -p 9030:9030 -p 8030:8030 -p 8040:8040 -it starrocks/allin1-ubuntu
****************************************************************************************************
* ________ _________ ________ ________ ________ ________ ________ ___ __ ________
* |\ ____\|\___ ___\\ __ \|\ __ \|\ __ \|\ __ \|\ ____\|\ \|\ \ |\ ____\
* \ \ \___|\|___ \ \_\ \ \|\ \ \ \|\ \ \ \|\ \ \ \|\ \ \ \___|\ \ \/ /|\ \ \___|_
* \ \_____ \ \ \ \ \ \ __ \ \ _ _\ \ _ _\ \ \\\ \ \ \ \ \ ___ \ \_____ \
* \|____|\ \ \ \ \ \ \ \ \ \ \ \\ \\ \ \\ \\ \ \\\ \ \ \____\ \ \\ \ \|____|\ \
* ____\_\ \ \ \__\ \ \__\ \__\ \__\\ _\\ \__\\ _\\ \_______\ \_______\ \__\\ \__\____\_\ \
* |\_________\ \|__| \|__|\|__|\|__|\|__|\|__|\|__|\|_______|\|_______|\|__| \|__|\_________\
* \|_________| \|_________|
*
* ________ ___ ___ ___ ________ _____
* |\ __ \|\ \ |\ \ |\ \|\ ___ \ / __ \
* \ \ \|\ \ \ \ \ \ \ \ \ \ \ \\ \ \|\/_|\ \
* \ \ __ \ \ \ \ \ \ \ \ \ \ \\ \ \|/ \ \ \
* \ \ \ \ \ \ \____\ \ \____\ \ \ \ \\ \ \ \ \ \
* \ \__\ \__\ \_______\ \_______\ \__\ \__\\ \__\ \ \__\
* \|__|\|__|\|_______|\|_______|\|__|\|__| \|__| \|__|
*
****************************************************************************************************
StarRocks [(Blazing Fast)]> _
2024-02-03 14:40:02,805 INFO Set uid to user 0 succeeded
2024-02-03 14:40:02,806 INFO RPC interface 'supervisor' initialized
2024-02-03 14:40:02,806 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-02-03 14:40:02,806 INFO supervisord started with pid 7
2024-02-03 14:40:03,809 INFO spawned: 'beservice' with pid 525
2024-02-03 14:40:03,809 INFO spawned: 'broker' with pid 526
2024-02-03 14:40:03,810 INFO spawned: 'director' with pid 527
2024-02-03 14:40:03,811 INFO spawned: 'feproxy' with pid 528
2024-02-03 14:40:03,812 INFO spawned: 'feservice' with pid 529
2024-02-03 14:40:03+00:00 INFO checking if need to perform auto registring Backend and Broker ...
2024-02-03 14:40:03+00:00 INFO checking if FE service query port:9030 alive or not ...
2024-02-03 14:40:04+00:00 WARN FE service query port:9030 is NOT alive yet!
2024-02-03 14:40:06+00:00 WARN FE service query port:9030 is NOT alive yet!
2024-02-03 14:40:08+00:00 WARN FE service query port:9030 is NOT alive yet!
2024-02-03 14:40:09,021 INFO success: beservice entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-02-03 14:40:09,022 INFO success: broker entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-02-03 14:40:09,022 INFO success: director entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-02-03 14:40:09,022 INFO success: feproxy entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-02-03 14:40:09,022 INFO success: feservice entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-02-03 14:40:10+00:00 WARN FE service query port:9030 is NOT alive yet!
2024-02-03 14:40:12+00:00 WARN FE service query port:9030 is NOT alive yet!
2024-02-03 14:40:14+00:00 INFO FE service query port:9030 is alive!
2024-02-03 14:40:14+00:00 INFO generate my.cnf file ...
2024-02-03 14:40:14+00:00 INFO check if need to add BE into FE service ...
2024-02-03 14:40:14+00:00 INFO Add BE(127.0.0.1:9050) into FE service ...
2024-02-03 14:40:14+00:00 INFO BE service already added into FE service ...
2024-02-03 14:40:14+00:00 INFO check if need to add BROKER into FE service ...
2024-02-03 14:40:14+00:00 INFO Add BROKER(127.0.0.1:8000) into FE service ...
2024-02-03 14:40:14+00:00 INFO broker service already added into FE service ...
2024-02-03 14:40:14+00:00 INFO cluster initialization DONE!
2024-02-03 14:40:14+00:00 INFO wait a few seconds for BE and Broker's heartbeat ...
2024-02-03 14:40:24+00:00 INFO StarRocks Cluster information details:
*************************** 1. row ***************************
Name: 127.0.0.1_9010_1706971205423
IP: 127.0.0.1
EditLogPort: 9010
HttpPort: 8030
QueryPort: 9030
RpcPort: 9020
Role: LEADER
ClusterId: 1851562636
Join: true
Alive: true
ReplayedJournalId: 16
LastHeartbeat: 2024-02-03 14:40:26
IsHelper: true
ErrMsg:
StartTime: 2024-02-03 14:40:12
Version: 3.2.2-269e832
*************************** 1. row ***************************
BackendId: 10004
IP: 127.0.0.1
HeartbeatPort: 9050
BePort: 9060
HttpPort: 8040
BrpcPort: 8060
LastStartTime: 2024-02-03 14:40:16
LastHeartbeat: 2024-02-03 14:40:26
Alive: true
SystemDecommissioned: false
ClusterDecommissioned: false
TabletNum: 0
DataUsedCapacity: 0.000 B
AvailCapacity: 57.645 GB
TotalCapacity: 117.028 GB
UsedPct: 50.74 %
MaxDiskUsedPct: 50.74 %
ErrMsg:
Version: 3.2.2-269e832
Status: {"lastSuccessReportTabletsTime":"2024-02-03 14:40:16"}
DataTotalCapacity: 57.645 GB
DataUsedPct: 0.00 %
CpuCores: 10
NumRunningQueries: 0
MemUsedPct: 1.85 %
CpuUsedPct: 0.0 %
*************************** 1. row ***************************
Name: allin1broker
IP: 127.0.0.1
Port: 8000
Alive: true
LastStartTime: 2024-02-03 14:40:16
LastUpdateTime: 2024-02-03 14:40:26
ErrMsg:
2024-02-03 14:40:26+00:00 INFO
2024-02-03 14:40:26+00:00 INFO
2024-02-03 14:40:26+00:00 INFO * IMPORTANT NOTICE!
If FE/BE state needs to be persisted, please be sure the following directories are mounted:
* FE service meta: /data/deploy/starrocks/fe/meta
* BE service storage: /data/deploy/starrocks/be/storage
2024-02-03 14:40:26+00:00 INFO
2024-02-03 14:40:26+00:00 INFO FE mysql query port: 9030
2024-02-03 14:40:26+00:00 INFO FE http service port: 8080
2024-02-03 14:40:26+00:00 INFO
2024-02-03 14:40:26+00:00 INFO Enjoy the journal to StarRocks blazing-fast lake-house engine!
I got it to work on my other server, thanks.
Could you build the binary yourself and test with this branch https://github.com/slingdata-io/sling-cli/pull/137?
Here is the bulk of the new code: https://github.com/slingdata-io/sling-cli/blob/v1.1.0/core/dbio/database/database_starrocks.go
Per docs, using INSERT VALUES
is not recommended for large datasets. So sling will accept a fe_url
property and will attempt to load there. See here for details. If no fe_url
url is provided, it will use INSERT VALUES
.
See sling docs here: https://docs.slingdata.io/connections/database-connections/starrocks
atwong@Albert-CelerData Downloads % sling conns set STARROCKSLOCAL url=starrocks://root:@localhost:9030/albert
11:55AM INF connection `STARROCKSLOCAL` has been set in /Users/atwong/.sling/env.yaml. Please test with `sling conns test STARROCKSLOCAL`
atwong@Albert-CelerData Downloads % sling conns test STARROCKSLOCAL
11:55AM INF success!
atwong@Albert-CelerData Downloads % sling run --src-stream 'file:///Users/atwong/sandbox/tpcds-parquet/call_center.parquet' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.callcenter' --mode full-refresh
11:55AM INF connecting to target database (starrocks)
11:55AM INF reading from source file system (file)
11:55AM INF writing to target database [mode: full-refresh]
11:55AM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not create temp table `albert`.`callcenter_tmp`
~ Could not generate DDL for `albert`.`callcenter_tmp`
did not provide primary key for creating StarRocks table
atwong@Albert-CelerData Downloads % sling run --src-stream 'file:///Users/atwong/sandbox/tpcds-parquet/call_center.parquet' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.callcenter' --mode full-refresh
12:01PM INF connecting to target database (starrocks)
12:01PM INF reading from source file system (file)
12:01PM INF writing to target database [mode: full-refresh]
12:01PM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not create temp table `albert`.`callcenter_tmp`
~ Could not generate DDL for `albert`.`callcenter_tmp`
did not provide primary key for creating StarRocks table
atwong@Albert-CelerData Downloads % cat ~/.sling/env.yaml
# Environment Credentials for Sling CLI
# See https://docs.slingdata.io/sling-cli/environment
connections:
MYSQL:
type: mysql
url: mysql://admin:admin@hw2aq4jfy.cloud-app.celerdata.com:9030/albert
MYSQLLOCAL:
type: mysql
url: mysql://root:@localhost:9030/albert
STARROCKSLOCAL:
type: starrocks
url: starrocks://root:@localhost:9030/albert
fe_url: http://localhost:8030/api/albert/albert/_stream_load
variables: {}
Yes error says did not provide primary key for creating StarRocks table
Did you provide a PK? :)
:-). How did you test? Also in the docs, I didn't see instructions on how I can configure a PK.
StarRocks > CREATE TABLE call_center(cc_call_center_id VARCHAR, cc_rec_start_date DATE, cc_rec_end_date DATE, cc_closed_date_sk INTEGER, cc_open_date_sk INTEGER, cc_name VARCHAR, cc_class VARCHAR, cc_employees INTEGER, cc_sq_ft INTEGER, cc_hours VARCHAR, cc_manager VARCHAR, cc_mkt_id INTEGER, cc_mkt_class VARCHAR, cc_mkt_desc VARCHAR, cc_market_manager VARCHAR, cc_division INTEGER, cc_division_name VARCHAR, cc_company INTEGER, cc_company_name VARCHAR, cc_street_number VARCHAR, cc_street_name VARCHAR, cc_street_type VARCHAR, cc_suite_number VARCHAR, cc_city VARCHAR, cc_county VARCHAR, cc_state VARCHAR, cc_zip VARCHAR, cc_country VARCHAR, cc_gmt_offset DECIMAL(5,2), cc_tax_percentage DECIMAL(5,2)) PRIMARY KEY (cc_call_center_id) DISTRIBUTED BY HASH(cc_call_center_id);
Query OK, 0 rows affected (0.01 sec)
StarRocks > CREATE TABLE call_center_tmp(cc_call_center_id VARCHAR, cc_rec_start_date DATE, cc_rec_end_date DATE, cc_closed_date_sk INTEGER, cc_open_date_sk INTEGER, cc_name VARCHAR, cc_class VARCHAR, cc_employees INTEGER, cc_sq_ft INTEGER, cc_hours VARCHAR, cc_manager VARCHAR, cc_mkt_id INTEGER, cc_mkt_class VARCHAR, cc_mkt_desc VARCHAR, cc_market_manager VARCHAR, cc_division INTEGER, cc_division_name VARCHAR, cc_company INTEGER, cc_company_name VARCHAR, cc_street_number VARCHAR, cc_street_name VARCHAR, cc_street_type VARCHAR, cc_suite_number VARCHAR, cc_city VARCHAR, cc_county VARCHAR, cc_state VARCHAR, cc_zip VARCHAR, cc_country VARCHAR, cc_gmt_offset DECIMAL(5,2), cc_tax_percentage DECIMAL(5,2)) PRIMARY KEY (cc_call_center_id) DISTRIBUTED BY HASH(cc_call_center_id);
Query OK, 0 rows affected (0.01 sec)
atwong@Albert-CelerData Downloads % sling run --src-stream 'file:///Users/atwong/sandbox/tpcds-parquet/call_center.parquet' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.call_center' --mode full-refresh
12:53PM INF connecting to target database (starrocks)
12:53PM INF reading from source file system (file)
12:53PM INF writing to target database [mode: full-refresh]
12:53PM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not create temp table `albert`.`call_center_tmp`
~ Could not generate DDL for `albert`.`call_center_tmp`
did not provide primary key for creating StarRocks table
You have to provide in your flags, like this:
--mode incremental --primary-key 'col1,col2' --update-key 'last_modified_dt'
See examples: https://docs.slingdata.io/sling-cli/run/configuration#cli-flags
how do you select a primary key with this kind of CSV?
atwong@Albert-CelerData auxjars % cat ~/Downloads/example1.csv
1,Lily,23
2,Rose,23
3,Alice,24
4,Julia,25%
how do you select a primary key with this kind of JSON?
atwong@Albert-CelerData auxjars % cat ~/Downloads/example2.json
{"name": "Beijing", "code": 2}
I also got news that we are in process of refactoring the curl stream load. I would only support sql operations for now.
how do you select a primary key with this kind of CSV?
Sling will add dummy names, col_01
, col_02
, so use those. You need to load with source.options.header = false
how do you select a primary key with this kind of JSON?
Use source.options.flatten = true
, and specify one of the keys as PK.
I also got news that we are in process of refactoring the curl stream load. I would only support sql operations for now.
Oh, will there be an alternative to bulk load via HTTP (FE)? or will the INSERT VALUES performance be improved for large datasets?
atwong@Albert-CelerData auxjars % sling run --src-stream 'file:///Users/atwong/Downloads/example1.csv' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.call_center2' --mode full-refresh --src-options '{"source.option.header":false}' --primary-key col_01
10:02AM INF connecting to target database (starrocks)
10:02AM INF reading from source file system (file)
10:02AM INF writing to target database [mode: full-refresh]
10:02AM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not create temp table `albert`.`call_center2_tmp`
~ Could not generate DDL for `albert`.`call_center2_tmp`
did not provide primary key for creating StarRocks table
try: --src-options '{"header": false}'
Also, maybe using the YAML config would be clearer to use, instead of CLI flags?
Also it's col_001
, col_002
... You can also just test with --stdout
atwong@Albert-CelerData auxjars % sling run --src-stream 'file:///Users/atwong/Downloads/example1.csv' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.call_center2' --mode full-refresh --src-options '{"header":false}' --primary-key col_001
11:24AM INF connecting to target database (starrocks)
11:24AM INF reading from source file system (file)
11:24AM INF writing to target database [mode: full-refresh]
11:24AM INF streaming data
11:24AM WRN Using INSERT mode which is meant for small datasets. Please set the `fe_url` for loading large datasets via Stream Load mode. See https://docs.slingdata.io/connections/database-connections/starrocks
11:24AM INF dropped table `albert`.`call_center2`
11:24AM INF created table `albert`.`call_center2`
11:24AM INF inserted 4 rows into `albert`.`call_center2` in 0 secs [7 r/s]
11:24AM INF execution succeeded
atwong@Albert-CelerData auxjars % sling run --src-stream 'file:///Users/atwong/Downloads/example1.csv' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.call_center2' --mode full-refresh --src-options '{"header":false}' --primary-key col_01
11:24AM INF connecting to target database (starrocks)
11:24AM INF reading from source file system (file)
11:24AM INF writing to target database [mode: full-refresh]
11:24AM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not create temp table `albert`.`call_center2_tmp`
~ Could not generate DDL for `albert`.`call_center2_tmp`
did not provide primary key for creating StarRocks table
atwong@Albert-CelerData auxjars % sling run --src-stream 'file:///Users/atwong/Downloads/example1.csv' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.call_center3' --mode full-refresh --src-options '{"header":false}' --primary-key col_01
11:25AM INF connecting to target database (starrocks)
11:25AM INF reading from source file system (file)
11:25AM INF writing to target database [mode: full-refresh]
11:25AM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not create temp table `albert`.`call_center3_tmp`
~ Could not generate DDL for `albert`.`call_center3_tmp`
did not provide primary key for creating StarRocks table
StarRocks > select * from call_center2
-> ;
+---------+---------+---------+------------------+
| col_001 | col_002 | col_003 | _sling_loaded_at |
+---------+---------+---------+------------------+
| 1 | Lily | 23 | 1707247462 |
| 2 | Rose | 23 | 1707247462 |
| 4 | Julia | 25 | 1707247462 |
| 3 | Alice | 24 | 1707247462 |
+---------+---------+---------+------------------+
4 rows in set (0.02 sec)
interesting.... name and code are equal to each other in json but it's not when we get to table create. Is there a way to reorder the sequence in table create? If not, I can see if I can make changes on the DB side to reorder columns in table create. Is there also a debug mode so I can see the sql executed?
atwong@Albert-CelerData auxjars % sling run --src-stream 'file:///Users/atwong/Downloads/example2.json' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.call_center4' --mode full-refresh --src-options '{"flatten":true}' --primary-key name
11:29AM INF connecting to target database (starrocks)
11:29AM INF reading from source file system (file)
11:29AM INF writing to target database [mode: full-refresh]
11:29AM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not create temp table `albert`.`call_center4_tmp`
~ Error creating table `albert`.`call_center4_tmp`
~ Could not execute SQL
~ Error executing [tx: false] create table if not exists `***`.`call_center4_tmp` (`code` decimal(24,9),
`name` string,
`_sling_loaded_at` integer) duplicate key(`name`) distributed by hash(`name`)
Error 1064: Getting analyzing error. Detail message: Key columns must be the first few columns of the schema and the order of the key columns must be consistent with the order of the schema.
atwong@Albert-CelerData auxjars % sling run --src-stream 'file:///Users/atwong/Downloads/example2.json' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.call_center4' --mode full-refresh --src-options '{"flatten":true}' --primary-key code
11:30AM INF connecting to target database (starrocks)
11:30AM INF reading from source file system (file)
11:30AM INF writing to target database [mode: full-refresh]
11:30AM INF streaming data
11:30AM WRN Using INSERT mode which is meant for small datasets. Please set the `fe_url` for loading large datasets via Stream Load mode. See https://docs.slingdata.io/connections/database-connections/starrocks
11:30AM INF dropped table `albert`.`call_center4`
11:30AM INF created table `albert`.`call_center4`
11:30AM INF inserted 1 rows into `albert`.`call_center4` in 0 secs [2 r/s]
11:30AM INF execution succeeded
Oh, will there be an alternative to bulk load via HTTP (FE)? or will the INSERT VALUES performance be improved for large datasets?
I think we're going to consolidate to duckdb's way of doing files() and then work on performance then.
On create table, Allow users to pick a key from any of the columns instead of picking the first entry in the list of columns. https://github.com/StarRocks/starrocks/issues/40925
To debug, use add flag -d
.
I tried 2 CSV files, one with/without header, they both worked fine for me.
Here are the commands I used:
./sling run --src-stream file://./tests/files/test1.csv --tgt-conn starrocks --tgt-object public.test1 --primary-key id -d
./sling run --src-stream file://./tests/files/test2.csv --tgt-conn starrocks --tgt-object public.test2 --primary-key col_001 --src-options '{header: false}' -d
I tried 2 CSV files, one with/without header, they both worked fine for me.
I didn't get any issues with CSV. It's picking the primary key for json import.
Cool. Any other issues do you foresee? I think i'll merge soon if not, looks like it's working.
Is there any way to force the column order on your side while I work on this StarRocks enhancement? Or than that, LGTM.
If the source is a file, then no, it will maintain the column order. JSON flatten order is non-deterministic (the order that the keys are encountered). If the source is a database, then you can do custom SQL to reorder the columns, other than that, no.
ref: https://docs.dagster.io/integrations/embedded-elt