StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.74k stars 1.75k forks source link

Support sling https://slingdata.io/ which is the embedded ELT tool for Dagster #40481

Closed alberttwong closed 7 months ago

alberttwong commented 8 months ago

ref: https://docs.dagster.io/integrations/embedded-elt

atwong@Albert-CelerData ~ % mysql -P 9030 -h hw2aq4jfy.cloud-app.celerdata.com -u admin -p --prompt="StarRocks > "
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1775
Server version: 5.1.0 3.1.3-ee-438bc37

Copyright (c) 2000, 2023, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

StarRocks > select * from albert;
ERROR 1064 (HY000): Getting analyzing error. Detail message: No database selected.
StarRocks > use albert;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
StarRocks > select * from employees;
+------+--------------+------+------------+
| id   | name         | age  | department |
+------+--------------+------+------------+
|    1 | Peter Parker |   19 | IT         |
|    3 | Tony Stark   |   45 | Finance    |
|    2 | Bruce Wayne  |   29 | Automobile |
|    4 | Clark Kent   |   50 | Media      |
+------+--------------+------+------------+
4 rows in set (0.07 sec)

StarRocks > exit
Bye
atwong@Albert-CelerData ~ % sling conns set MYSQL url=mysql://admin:@hw2aq4jfy.cloud-app.celerdata.com:9030/albert
11:39AM INF connection `MYSQL` has been set in /Users/atwong/.sling/env.yaml. Please test with `sling conns test MYSQL`
atwong@Albert-CelerData ~ % sling conns test MYSQL
11:40AM INF success!
atwong@Albert-CelerData ~ % sling conns discover MYSQL
11:40AM INF Found 0 streams:
atwong@Albert-CelerData ~ % sling run --src-conn MYSQL --src-stream 'albert.employees' --stdout
11:43AM INF connecting to source database (mysql)
11:43AM INF reading from source database
11:43AM INF writing to target stream (stdout)
id,name,age,department
2,Bruce Wayne,29,Automobile
4,Clark Kent,50,Media
1,Peter Parker,19,IT
3,Tony Stark,45,Finance
11:43AM INF wrote 4 rows [15 r/s] to stdout
11:43AM INF execution succeeded
flarco commented 8 months ago

Cool! The discover function uses this query: https://github.com/flarco/dbio/blob/main/database/templates/mysql.yaml#L95-L122 Do you think something needs adjustment? I will create a new template for starrocks.

alberttwong commented 8 months ago
atwong@Albert-CelerData tpcds-parquet % sling run --src-stream 'file:///Users/atwong/sandbox/tpcds-parquet/call_center.parquet' --tgt-conn MYSQL --tgt-object 'albert.callcenter' --mode full-refresh
12:08PM INF connecting to target database (mysql)
12:08PM INF reading from source file system (file)
12:08PM INF writing to target database [mode: full-refresh]
12:08PM INF streaming data
12:08PM INF execution failed

fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
context canceled

~ insertBatch
context canceled

~ could not prepare Tx: INSERT INTO `albert`.`callcenter_tmp` (`cc_call_center_sk`, `cc_call_center_id`, `cc_rec_start_date`, `cc_rec_end_date`, `cc_closed_date_sk`, `cc_open_date_sk`, `cc_name`, `cc_class`, `cc_employees`, `cc_sq_ft`, `cc_hours`, `cc_manager`, `cc_mkt_id`, `cc_mkt_class`, `cc_mkt_desc`, `cc_market_manager`, `cc_division`, `cc_division_name`, `cc_company`, `cc_company_name`, `cc_street_number`, `cc_street_name`, `cc_street_type`, `cc_suite_number`, `cc_city`, `cc_county`, `cc_state`, `cc_zip`, `cc_country`, `cc_gmt_offset`, `cc_tax_percentage`, `_sling_loaded_at`) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
Error 1064: Unsupported command(COM_STMT_PREPARE)

~ could not insert into `albert`.`callcenter_tmp`.
context canceled

~ insertBatch
context canceled

~ could not prepare Tx: INSERT INTO `albert`.`callcenter_tmp` (`cc_call_center_sk`, `cc_call_center_id`, `cc_rec_start_date`, `cc_rec_end_date`, `cc_closed_date_sk`, `cc_open_date_sk`, `cc_name`, `cc_class`, `cc_employees`, `cc_sq_ft`, `cc_hours`, `cc_manager`, `cc_mkt_id`, `cc_mkt_class`, `cc_mkt_desc`, `cc_market_manager`, `cc_division`, `cc_division_name`, `cc_company`, `cc_company_name`, `cc_street_number`, `cc_street_name`, `cc_street_type`, `cc_suite_number`, `cc_city`, `cc_county`, `cc_state`, `cc_zip`, `cc_country`, `cc_gmt_offset`, `cc_tax_percentage`, `_sling_loaded_at`) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
Error 1064: Unsupported command(COM_STMT_PREPARE)
flarco commented 8 months ago

Cool, I figured so, I saw a COPY command in the docs but not sure how it works yet. What's the best way to bulk insert from a MySQL client?

alberttwong commented 8 months ago

What's the best way to bulk insert from a MySQL client?

INSERT INTO employees (id, name, age, department) VALUES (1, 'Peter Parker', 19, 'IT'), (2, 'Bruce Wayne', 29, 'Automobile'), (3, 'Tony Stark', 45, 'Finance'), (4, 'Clark Kent', 50, 'Media');

alberttwong commented 8 months ago

The issue is that we emulate the mysql wire protocol. We have 3 different table types (duplicate, aggregate and primary key). so the select is fine but table creation is going to be a problem. We need something like this

create database albert;
use albert;
CREATE TABLE employees (
  id INT  KEY,
  name VARCHAR(50),
  age INT,
  department VARCHAR(50) 

) PRIMARY KEY (id) DISTRIBUTED BY HASH(id);
alberttwong commented 8 months ago

The discover function uses this query: https://github.com/flarco/dbio/blob/main/database/templates/mysql.yaml#L95-L122

I don't know if it's weird but select table_catalog from information_schema.columns; shows null on my end so I had to remove it from your query. I logged another github issue on my end. https://github.com/StarRocks/starrocks/issues/40482

      with tables as (
      select
        table_catalog,
        table_schema,
        table_name,
        case table_type
          when 'VIEW' then true
          else false
        end as is_view
      from information_schema.tables
       where table_schema = 'tpcds' and table_name = 'call_center'
    )
    select
      cols.table_schema as schema_name,
      cols.table_name as table_name,
      tables.is_view as is_view,
      cols.column_name as column_name,
      cols.data_type as data_type,
      cols.ordinal_position as position
    from information_schema.columns cols
    join tables
      on  tables.table_schema = cols.table_schema
      and tables.table_name = cols.table_name
    order by cols.table_catalog, cols.table_schema, cols.table_name, cols.ordinal_position;
alberttwong commented 8 months ago

I think this should work. please confirm that discover is supposed to output this.

StarRocks >       with tables as (
    ->       select
    ->         table_catalog,
    ->         table_schema,
    ->         table_name,
    ->         case table_type
    ->           when 'VIEW' then true
    ->           else false
    ->         end as is_view
    ->       from information_schema.tables
    ->        where table_schema = 'tpcds' and table_name = 'call_center'
    ->     )
    ->     select
    ->       cols.table_schema as schema_name,
    ->       cols.table_name as table_name,
    ->       tables.is_view as is_view,
    ->       cols.column_name as column_name,
    ->       cols.data_type as data_type,
    ->       cols.ordinal_position as position
    ->     from information_schema.columns cols
    ->     join tables
    ->       on  tables.table_schema = cols.table_schema
    ->       and tables.table_name = cols.table_name
    ->     order by cols.table_catalog, cols.table_schema, cols.table_name, cols.ordinal_position;
+-------------+-------------+---------+-------------------+-----------+----------+
| schema_name | table_name  | is_view | column_name       | data_type | position |
+-------------+-------------+---------+-------------------+-----------+----------+
| tpcds       | call_center |       0 | cc_call_center_sk | int       |        1 |
| tpcds       | call_center |       0 | cc_call_center_id | varchar   |        2 |
| tpcds       | call_center |       0 | cc_rec_start_date | date      |        3 |
| tpcds       | call_center |       0 | cc_rec_end_date   | date      |        4 |
| tpcds       | call_center |       0 | cc_closed_date_sk | int       |        5 |
| tpcds       | call_center |       0 | cc_open_date_sk   | int       |        6 |
| tpcds       | call_center |       0 | cc_name           | varchar   |        7 |
| tpcds       | call_center |       0 | cc_class          | varchar   |        8 |
| tpcds       | call_center |       0 | cc_employees      | int       |        9 |
| tpcds       | call_center |       0 | cc_sq_ft          | int       |       10 |
| tpcds       | call_center |       0 | cc_hours          | varchar   |       11 |
| tpcds       | call_center |       0 | cc_manager        | varchar   |       12 |
| tpcds       | call_center |       0 | cc_mkt_id         | int       |       13 |
| tpcds       | call_center |       0 | cc_mkt_class      | varchar   |       14 |
| tpcds       | call_center |       0 | cc_mkt_desc       | varchar   |       15 |
| tpcds       | call_center |       0 | cc_market_manager | varchar   |       16 |
| tpcds       | call_center |       0 | cc_division       | int       |       17 |
| tpcds       | call_center |       0 | cc_division_name  | varchar   |       18 |
| tpcds       | call_center |       0 | cc_company        | int       |       19 |
| tpcds       | call_center |       0 | cc_company_name   | varchar   |       20 |
| tpcds       | call_center |       0 | cc_street_number  | varchar   |       21 |
| tpcds       | call_center |       0 | cc_street_name    | varchar   |       22 |
| tpcds       | call_center |       0 | cc_street_type    | varchar   |       23 |
| tpcds       | call_center |       0 | cc_suite_number   | varchar   |       24 |
| tpcds       | call_center |       0 | cc_city           | varchar   |       25 |
| tpcds       | call_center |       0 | cc_county         | varchar   |       26 |
| tpcds       | call_center |       0 | cc_state          | varchar   |       27 |
| tpcds       | call_center |       0 | cc_zip            | varchar   |       28 |
| tpcds       | call_center |       0 | cc_country        | varchar   |       29 |
| tpcds       | call_center |       0 | cc_gmt_offset     | decimal   |       30 |
| tpcds       | call_center |       0 | cc_tax_percentage | decimal   |       31 |
+-------------+-------------+---------+-------------------+-----------+----------+
31 rows in set (0.08 sec)
flarco commented 8 months ago

I think this should work. please confirm that discover is supposed to output this.

Yes looks good, thanks.

I'll take it from here. The DDL should be fine, just need to figure out how to push the primary key through. Question about that, is the PRIMARY KEY (id) DISTRIBUTED BY HASH(id) part mandatory? Not too sure what is best practice.

alberttwong commented 8 months ago

Question about that, is the PRIMARY KEY (id) DISTRIBUTED BY HASH(id) part mandatory? Not too sure what is best practice.

So it can be one of several types. For you to load data into a temp table, I would recommend primary key table for row uniqueness of which then can be inserted to the final table type. Also I happen to pick the primary key as id but it can be a combination of columns.

CREATE TABLE IF NOT EXISTS detail (
    event_time DATETIME NOT NULL COMMENT "datetime of event",
    event_type INT NOT NULL COMMENT "type of event",
    user_id INT COMMENT "id of user",
    device_code INT COMMENT "device code",
    channel INT COMMENT ""
)
DUPLICATE KEY(event_time, event_type)
DISTRIBUTED BY HASH(user_id);
CREATE TABLE IF NOT EXISTS example_db.aggregate_tbl (
    site_id LARGEINT NOT NULL COMMENT "id of site",
    date DATE NOT NULL COMMENT "time of event",
    city_code VARCHAR(20) COMMENT "city_code of user",
    pv BIGINT SUM DEFAULT "0" COMMENT "total page views"
)
AGGREGATE KEY(site_id, date, city_code)
DISTRIBUTED BY HASH(site_id)
PROPERTIES (
"replication_num" = "3"
);
flarco commented 8 months ago

@alberttwong trying to stand up a docker instance, following the steps here: https://docs.starrocks.io/docs/quick_start/deploy_with_docker/

It's not working it seems... any advice? See log output below. It keeps printing WARN FE service query port:9030 is NOT alive yet!.

Here is the my config in my docker-compose.yaml:

  starrocks:
    image: starrocks/allin1-ubuntu
    restart: unless-stopped
    hostname: devbox-starrocks
    container_name: devbox-starrocks
    environment: {}
    ports:
      - "9030:9030"
      - "8030:8030"
      - "8040:8040"

I also tried with docker run --rm -p 9030:9030 -p 8030:8030 -p 8040:8040 -it starrocks/allin1-ubuntu, same result.

larco@mpc:/__/devbox$ dcp up starrocks
Recreating devbox-starrocks ... done
Attaching to devbox-starrocks
devbox-starrocks | ****************************************************************************************************
devbox-starrocks | *   ________  _________  ________  ________  ________  ________  ________  ___  __    ________      
devbox-starrocks | *  |\   ____\|\___   ___\\   __  \|\   __  \|\   __  \|\   __  \|\   ____\|\  \|\  \ |\   ____\     
devbox-starrocks | *  \ \  \___|\|___ \  \_\ \  \|\  \ \  \|\  \ \  \|\  \ \  \|\  \ \  \___|\ \  \/  /|\ \  \___|_    
devbox-starrocks | *   \ \_____  \   \ \  \ \ \   __  \ \   _  _\ \   _  _\ \  \\\  \ \  \    \ \   ___  \ \_____  \   
devbox-starrocks | *    \|____|\  \   \ \  \ \ \  \ \  \ \  \\  \\ \  \\  \\ \  \\\  \ \  \____\ \  \\ \  \|____|\  \  
devbox-starrocks | *      ____\_\  \   \ \__\ \ \__\ \__\ \__\\ _\\ \__\\ _\\ \_______\ \_______\ \__\\ \__\____\_\  \ 
devbox-starrocks | *     |\_________\   \|__|  \|__|\|__|\|__|\|__|\|__|\|__|\|_______|\|_______|\|__| \|__|\_________\
devbox-starrocks | *     \|_________|                                                                      \|_________|
devbox-starrocks | *
devbox-starrocks | *                      ________  ___       ___       ___  ________     _____
devbox-starrocks | *                     |\   __  \|\  \     |\  \     |\  \|\   ___  \  / __  \
devbox-starrocks | *                     \ \  \|\  \ \  \    \ \  \    \ \  \ \  \\ \  \|\/_|\  \
devbox-starrocks | *                      \ \   __  \ \  \    \ \  \    \ \  \ \  \\ \  \|/ \ \  \
devbox-starrocks | *                       \ \  \ \  \ \  \____\ \  \____\ \  \ \  \\ \  \   \ \  \
devbox-starrocks | *                        \ \__\ \__\ \_______\ \_______\ \__\ \__\\ \__\   \ \__\
devbox-starrocks | *                         \|__|\|__|\|_______|\|_______|\|__|\|__| \|__|    \|__|
devbox-starrocks | *
devbox-starrocks | ****************************************************************************************************
devbox-starrocks | 
devbox-starrocks | StarRocks [(Blazing Fast)]> _
devbox-starrocks | 
devbox-starrocks | 
devbox-starrocks | 2024-02-03 12:15:36,052 INFO Set uid to user 0 succeeded
devbox-starrocks | 2024-02-03 12:15:36,055 INFO RPC interface 'supervisor' initialized
devbox-starrocks | 2024-02-03 12:15:36,055 CRIT Server 'unix_http_server' running without any HTTP authentication checking
devbox-starrocks | 2024-02-03 12:15:36,055 INFO supervisord started with pid 6
devbox-starrocks | 2024-02-03 12:15:37,058 INFO spawned: 'beservice' with pid 524
devbox-starrocks | 2024-02-03 12:15:37,059 INFO spawned: 'broker' with pid 525
devbox-starrocks | 2024-02-03 12:15:37,061 INFO spawned: 'director' with pid 526
devbox-starrocks | 2024-02-03 12:15:37,062 INFO spawned: 'feproxy' with pid 527
devbox-starrocks | 2024-02-03 12:15:37,064 INFO spawned: 'feservice' with pid 530
devbox-starrocks | 2024-02-03 12:15:37+00:00 INFO checking if need to perform auto registring Backend and Broker ...
devbox-starrocks | 2024-02-03 12:15:37,205 INFO exited: broker (exit status 0; not expected)
devbox-starrocks | 2024-02-03 12:15:37+00:00 INFO checking if FE service query port:9030 alive or not ...
devbox-starrocks | 2024-02-03 12:15:37+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:37,759 INFO exited: feservice (exit status 1; not expected)
devbox-starrocks | 2024-02-03 12:15:38,764 INFO spawned: 'broker' with pid 2186
devbox-starrocks | 2024-02-03 12:15:38,766 INFO spawned: 'feservice' with pid 2187
devbox-starrocks | 2024-02-03 12:15:38,767 INFO exited: beservice (exit status 127; not expected)
devbox-starrocks | 2024-02-03 12:15:38,871 INFO exited: broker (exit status 0; not expected)
devbox-starrocks | 2024-02-03 12:15:39+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:39,343 INFO exited: feservice (exit status 1; not expected)
devbox-starrocks | 2024-02-03 12:15:40,349 INFO spawned: 'beservice' with pid 2860
devbox-starrocks | 2024-02-03 12:15:41+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:41,349 INFO spawned: 'broker' with pid 3324
devbox-starrocks | 2024-02-03 12:15:41,350 INFO spawned: 'feservice' with pid 3325
devbox-starrocks | 2024-02-03 12:15:41,351 INFO exited: beservice (exit status 127; not expected)
devbox-starrocks | 2024-02-03 12:15:41,470 INFO exited: broker (exit status 0; not expected)
devbox-starrocks | 2024-02-03 12:15:42,705 INFO success: director entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
devbox-starrocks | 2024-02-03 12:15:42,705 INFO success: feproxy entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
devbox-starrocks | 2024-02-03 12:15:42,705 INFO exited: feservice (exit status 1; not expected)
devbox-starrocks | 2024-02-03 12:15:43+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:43,361 INFO spawned: 'beservice' with pid 3998
devbox-starrocks | 2024-02-03 12:15:44,605 INFO spawned: 'broker' with pid 4459
devbox-starrocks | 2024-02-03 12:15:44,606 INFO exited: beservice (exit status 127; not expected)
devbox-starrocks | 2024-02-03 12:15:44,707 INFO exited: broker (exit status 0; not expected)
devbox-starrocks | 2024-02-03 12:15:45+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:45,368 INFO gave up: broker entered FATAL state, too many start retries too quickly
devbox-starrocks | 2024-02-03 12:15:46,373 INFO spawned: 'feservice' with pid 4603
devbox-starrocks | 2024-02-03 12:15:47+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:47,373 INFO exited: feservice (exit status 1; not expected)
devbox-starrocks | 2024-02-03 12:15:48,378 INFO spawned: 'beservice' with pid 5136
devbox-starrocks | 2024-02-03 12:15:48,379 INFO gave up: feservice entered FATAL state, too many start retries too quickly
devbox-starrocks | 2024-02-03 12:15:49+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:49,383 INFO exited: beservice (exit status 127; not expected)
devbox-starrocks | 2024-02-03 12:15:50,385 INFO gave up: beservice entered FATAL state, too many start retries too quickly
devbox-starrocks | 2024-02-03 12:15:51+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:53+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:55+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:57+00:00 WARN FE service query port:9030 is NOT alive yet!
devbox-starrocks | 2024-02-03 12:15:59+00:00 WARN FE service query port:9030 is NOT alive yet!
^CGracefully stopping... (press Ctrl+C again to force)
Stopping devbox-starrocks   ... done
alberttwong commented 8 months ago

@flarco did you try only docker run -p 9030:9030 -p 8030:8030 -p 8040:8040 \ -itd starrocks/allin1-ubuntu ?

atwong@Albert-CelerData ~ % docker pull starrocks/allin1-ubuntu
Using default tag: latest
latest: Pulling from starrocks/allin1-ubuntu
Digest: sha256:cc3ef33b6a812ae256027f15fb56764ecdc2eb378b986cf2cf367292504442e8
Status: Image is up to date for starrocks/allin1-ubuntu:latest
docker.io/starrocks/allin1-ubuntu:latest

What's Next?
  View a summary of image vulnerabilities and recommendations → docker scout quickview starrocks/allin1-ubuntu
atwong@Albert-CelerData ~ % docker run --rm -p 9030:9030 -p 8030:8030 -p 8040:8040 -it starrocks/allin1-ubuntu
****************************************************************************************************
*   ________  _________  ________  ________  ________  ________  ________  ___  __    ________
*  |\   ____\|\___   ___\\   __  \|\   __  \|\   __  \|\   __  \|\   ____\|\  \|\  \ |\   ____\
*  \ \  \___|\|___ \  \_\ \  \|\  \ \  \|\  \ \  \|\  \ \  \|\  \ \  \___|\ \  \/  /|\ \  \___|_
*   \ \_____  \   \ \  \ \ \   __  \ \   _  _\ \   _  _\ \  \\\  \ \  \    \ \   ___  \ \_____  \
*    \|____|\  \   \ \  \ \ \  \ \  \ \  \\  \\ \  \\  \\ \  \\\  \ \  \____\ \  \\ \  \|____|\  \
*      ____\_\  \   \ \__\ \ \__\ \__\ \__\\ _\\ \__\\ _\\ \_______\ \_______\ \__\\ \__\____\_\  \
*     |\_________\   \|__|  \|__|\|__|\|__|\|__|\|__|\|__|\|_______|\|_______|\|__| \|__|\_________\
*     \|_________|                                                                      \|_________|
*
*                      ________  ___       ___       ___  ________     _____
*                     |\   __  \|\  \     |\  \     |\  \|\   ___  \  / __  \
*                     \ \  \|\  \ \  \    \ \  \    \ \  \ \  \\ \  \|\/_|\  \
*                      \ \   __  \ \  \    \ \  \    \ \  \ \  \\ \  \|/ \ \  \
*                       \ \  \ \  \ \  \____\ \  \____\ \  \ \  \\ \  \   \ \  \
*                        \ \__\ \__\ \_______\ \_______\ \__\ \__\\ \__\   \ \__\
*                         \|__|\|__|\|_______|\|_______|\|__|\|__| \|__|    \|__|
*
****************************************************************************************************

StarRocks [(Blazing Fast)]> _

2024-02-03 14:40:02,805 INFO Set uid to user 0 succeeded
2024-02-03 14:40:02,806 INFO RPC interface 'supervisor' initialized
2024-02-03 14:40:02,806 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-02-03 14:40:02,806 INFO supervisord started with pid 7
2024-02-03 14:40:03,809 INFO spawned: 'beservice' with pid 525
2024-02-03 14:40:03,809 INFO spawned: 'broker' with pid 526
2024-02-03 14:40:03,810 INFO spawned: 'director' with pid 527
2024-02-03 14:40:03,811 INFO spawned: 'feproxy' with pid 528
2024-02-03 14:40:03,812 INFO spawned: 'feservice' with pid 529
2024-02-03 14:40:03+00:00 INFO checking if need to perform auto registring Backend and Broker ...
2024-02-03 14:40:03+00:00 INFO checking if FE service query port:9030 alive or not ...
2024-02-03 14:40:04+00:00 WARN FE service query port:9030 is NOT alive yet!
2024-02-03 14:40:06+00:00 WARN FE service query port:9030 is NOT alive yet!
2024-02-03 14:40:08+00:00 WARN FE service query port:9030 is NOT alive yet!
2024-02-03 14:40:09,021 INFO success: beservice entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-02-03 14:40:09,022 INFO success: broker entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-02-03 14:40:09,022 INFO success: director entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-02-03 14:40:09,022 INFO success: feproxy entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-02-03 14:40:09,022 INFO success: feservice entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2024-02-03 14:40:10+00:00 WARN FE service query port:9030 is NOT alive yet!
2024-02-03 14:40:12+00:00 WARN FE service query port:9030 is NOT alive yet!
2024-02-03 14:40:14+00:00 INFO FE service query port:9030 is alive!
2024-02-03 14:40:14+00:00 INFO generate my.cnf file ...
2024-02-03 14:40:14+00:00 INFO check if need to add BE into FE service ...
2024-02-03 14:40:14+00:00 INFO Add BE(127.0.0.1:9050) into FE service ...

2024-02-03 14:40:14+00:00 INFO BE service already added into FE service ...
2024-02-03 14:40:14+00:00 INFO check if need to add BROKER into FE service ...
2024-02-03 14:40:14+00:00 INFO Add BROKER(127.0.0.1:8000) into FE service ...

2024-02-03 14:40:14+00:00 INFO broker service already added into FE service ...
2024-02-03 14:40:14+00:00 INFO cluster initialization DONE!
2024-02-03 14:40:14+00:00 INFO wait a few seconds for BE and Broker's heartbeat ...
2024-02-03 14:40:24+00:00 INFO StarRocks Cluster information details:
*************************** 1. row ***************************
             Name: 127.0.0.1_9010_1706971205423
               IP: 127.0.0.1
      EditLogPort: 9010
         HttpPort: 8030
        QueryPort: 9030
          RpcPort: 9020
             Role: LEADER
        ClusterId: 1851562636
             Join: true
            Alive: true
ReplayedJournalId: 16
    LastHeartbeat: 2024-02-03 14:40:26
         IsHelper: true
           ErrMsg:
        StartTime: 2024-02-03 14:40:12
          Version: 3.2.2-269e832
*************************** 1. row ***************************
            BackendId: 10004
                   IP: 127.0.0.1
        HeartbeatPort: 9050
               BePort: 9060
             HttpPort: 8040
             BrpcPort: 8060
        LastStartTime: 2024-02-03 14:40:16
        LastHeartbeat: 2024-02-03 14:40:26
                Alive: true
 SystemDecommissioned: false
ClusterDecommissioned: false
            TabletNum: 0
     DataUsedCapacity: 0.000 B
        AvailCapacity: 57.645 GB
        TotalCapacity: 117.028 GB
              UsedPct: 50.74 %
       MaxDiskUsedPct: 50.74 %
               ErrMsg:
              Version: 3.2.2-269e832
               Status: {"lastSuccessReportTabletsTime":"2024-02-03 14:40:16"}
    DataTotalCapacity: 57.645 GB
          DataUsedPct: 0.00 %
             CpuCores: 10
    NumRunningQueries: 0
           MemUsedPct: 1.85 %
           CpuUsedPct: 0.0 %
*************************** 1. row ***************************
          Name: allin1broker
            IP: 127.0.0.1
          Port: 8000
         Alive: true
 LastStartTime: 2024-02-03 14:40:16
LastUpdateTime: 2024-02-03 14:40:26
        ErrMsg:
2024-02-03 14:40:26+00:00 INFO
2024-02-03 14:40:26+00:00 INFO
2024-02-03 14:40:26+00:00 INFO  * IMPORTANT NOTICE!

If FE/BE state needs to be persisted, please be sure the following directories are mounted:
* FE service meta: /data/deploy/starrocks/fe/meta
* BE service storage: /data/deploy/starrocks/be/storage

2024-02-03 14:40:26+00:00 INFO
2024-02-03 14:40:26+00:00 INFO FE mysql query port: 9030
2024-02-03 14:40:26+00:00 INFO FE http service port: 8080
2024-02-03 14:40:26+00:00 INFO
2024-02-03 14:40:26+00:00 INFO Enjoy the journal to StarRocks blazing-fast lake-house engine!
flarco commented 8 months ago

I got it to work on my other server, thanks. Could you build the binary yourself and test with this branch https://github.com/slingdata-io/sling-cli/pull/137? Here is the bulk of the new code: https://github.com/slingdata-io/sling-cli/blob/v1.1.0/core/dbio/database/database_starrocks.go Per docs, using INSERT VALUES is not recommended for large datasets. So sling will accept a fe_url property and will attempt to load there. See here for details. If no fe_url url is provided, it will use INSERT VALUES. See sling docs here: https://docs.slingdata.io/connections/database-connections/starrocks

alberttwong commented 7 months ago
atwong@Albert-CelerData Downloads % sling conns set STARROCKSLOCAL url=starrocks://root:@localhost:9030/albert
11:55AM INF connection `STARROCKSLOCAL` has been set in /Users/atwong/.sling/env.yaml. Please test with `sling conns test STARROCKSLOCAL`
atwong@Albert-CelerData Downloads % sling conns test STARROCKSLOCAL
11:55AM INF success!
atwong@Albert-CelerData Downloads % sling run --src-stream 'file:///Users/atwong/sandbox/tpcds-parquet/call_center.parquet' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.callcenter' --mode full-refresh
11:55AM INF connecting to target database (starrocks)
11:55AM INF reading from source file system (file)
11:55AM INF writing to target database [mode: full-refresh]
11:55AM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not create temp table `albert`.`callcenter_tmp`
~ Could not generate DDL for `albert`.`callcenter_tmp`
did not provide primary key for creating StarRocks table
atwong@Albert-CelerData Downloads % sling run --src-stream 'file:///Users/atwong/sandbox/tpcds-parquet/call_center.parquet' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.callcenter' --mode full-refresh
12:01PM INF connecting to target database (starrocks)
12:01PM INF reading from source file system (file)
12:01PM INF writing to target database [mode: full-refresh]
12:01PM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not create temp table `albert`.`callcenter_tmp`
~ Could not generate DDL for `albert`.`callcenter_tmp`
did not provide primary key for creating StarRocks table
atwong@Albert-CelerData Downloads % cat ~/.sling/env.yaml
# Environment Credentials for Sling CLI

# See https://docs.slingdata.io/sling-cli/environment

connections:
  MYSQL:
    type: mysql
    url: mysql://admin:admin@hw2aq4jfy.cloud-app.celerdata.com:9030/albert

  MYSQLLOCAL:
    type: mysql
    url: mysql://root:@localhost:9030/albert

  STARROCKSLOCAL:
    type: starrocks
    url: starrocks://root:@localhost:9030/albert
    fe_url: http://localhost:8030/api/albert/albert/_stream_load

variables: {}
flarco commented 7 months ago

Yes error says did not provide primary key for creating StarRocks table Did you provide a PK? :)

alberttwong commented 7 months ago

:-). How did you test? Also in the docs, I didn't see instructions on how I can configure a PK.

StarRocks > CREATE TABLE call_center(cc_call_center_id VARCHAR, cc_rec_start_date DATE, cc_rec_end_date DATE, cc_closed_date_sk INTEGER, cc_open_date_sk INTEGER, cc_name VARCHAR, cc_class VARCHAR, cc_employees INTEGER, cc_sq_ft INTEGER, cc_hours VARCHAR, cc_manager VARCHAR, cc_mkt_id INTEGER, cc_mkt_class VARCHAR, cc_mkt_desc VARCHAR, cc_market_manager VARCHAR, cc_division INTEGER, cc_division_name VARCHAR, cc_company INTEGER, cc_company_name VARCHAR, cc_street_number VARCHAR, cc_street_name VARCHAR, cc_street_type VARCHAR, cc_suite_number VARCHAR, cc_city VARCHAR, cc_county VARCHAR, cc_state VARCHAR, cc_zip VARCHAR, cc_country VARCHAR, cc_gmt_offset DECIMAL(5,2), cc_tax_percentage DECIMAL(5,2)) PRIMARY KEY (cc_call_center_id) DISTRIBUTED BY HASH(cc_call_center_id);
Query OK, 0 rows affected (0.01 sec)

StarRocks > CREATE TABLE call_center_tmp(cc_call_center_id VARCHAR, cc_rec_start_date DATE, cc_rec_end_date DATE, cc_closed_date_sk INTEGER, cc_open_date_sk INTEGER, cc_name VARCHAR, cc_class VARCHAR, cc_employees INTEGER, cc_sq_ft INTEGER, cc_hours VARCHAR, cc_manager VARCHAR, cc_mkt_id INTEGER, cc_mkt_class VARCHAR, cc_mkt_desc VARCHAR, cc_market_manager VARCHAR, cc_division INTEGER, cc_division_name VARCHAR, cc_company INTEGER, cc_company_name VARCHAR, cc_street_number VARCHAR, cc_street_name VARCHAR, cc_street_type VARCHAR, cc_suite_number VARCHAR, cc_city VARCHAR, cc_county VARCHAR, cc_state VARCHAR, cc_zip VARCHAR, cc_country VARCHAR, cc_gmt_offset DECIMAL(5,2), cc_tax_percentage DECIMAL(5,2)) PRIMARY KEY (cc_call_center_id) DISTRIBUTED BY HASH(cc_call_center_id);
Query OK, 0 rows affected (0.01 sec)
atwong@Albert-CelerData Downloads % sling run --src-stream 'file:///Users/atwong/sandbox/tpcds-parquet/call_center.parquet' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.call_center' --mode full-refresh
12:53PM INF connecting to target database (starrocks)
12:53PM INF reading from source file system (file)
12:53PM INF writing to target database [mode: full-refresh]
12:53PM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not create temp table `albert`.`call_center_tmp`
~ Could not generate DDL for `albert`.`call_center_tmp`
did not provide primary key for creating StarRocks table
flarco commented 7 months ago

You have to provide in your flags, like this: --mode incremental --primary-key 'col1,col2' --update-key 'last_modified_dt' See examples: https://docs.slingdata.io/sling-cli/run/configuration#cli-flags

alberttwong commented 7 months ago

how do you select a primary key with this kind of CSV?

atwong@Albert-CelerData auxjars % cat ~/Downloads/example1.csv
1,Lily,23
2,Rose,23
3,Alice,24
4,Julia,25%

how do you select a primary key with this kind of JSON?

atwong@Albert-CelerData auxjars % cat ~/Downloads/example2.json
{"name": "Beijing", "code": 2}
alberttwong commented 7 months ago

I also got news that we are in process of refactoring the curl stream load. I would only support sql operations for now.

flarco commented 7 months ago

how do you select a primary key with this kind of CSV?

Sling will add dummy names, col_01, col_02, so use those. You need to load with source.options.header = false

how do you select a primary key with this kind of JSON?

Use source.options.flatten = true, and specify one of the keys as PK.

flarco commented 7 months ago

I also got news that we are in process of refactoring the curl stream load. I would only support sql operations for now.

Oh, will there be an alternative to bulk load via HTTP (FE)? or will the INSERT VALUES performance be improved for large datasets?

alberttwong commented 7 months ago
atwong@Albert-CelerData auxjars % sling run --src-stream 'file:///Users/atwong/Downloads/example1.csv' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.call_center2' --mode full-refresh --src-options '{"source.option.header":false}' --primary-key col_01
10:02AM INF connecting to target database (starrocks)
10:02AM INF reading from source file system (file)
10:02AM INF writing to target database [mode: full-refresh]
10:02AM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not create temp table `albert`.`call_center2_tmp`
~ Could not generate DDL for `albert`.`call_center2_tmp`
did not provide primary key for creating StarRocks table
flarco commented 7 months ago

try: --src-options '{"header": false}' Also, maybe using the YAML config would be clearer to use, instead of CLI flags?

flarco commented 7 months ago

Also it's col_001, col_002... You can also just test with --stdout

alberttwong commented 7 months ago
atwong@Albert-CelerData auxjars % sling run --src-stream 'file:///Users/atwong/Downloads/example1.csv' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.call_center2' --mode full-refresh --src-options '{"header":false}' --primary-key col_001
11:24AM INF connecting to target database (starrocks)
11:24AM INF reading from source file system (file)
11:24AM INF writing to target database [mode: full-refresh]
11:24AM INF streaming data
11:24AM WRN Using INSERT mode which is meant for small datasets. Please set the `fe_url` for loading large datasets via Stream Load mode. See https://docs.slingdata.io/connections/database-connections/starrocks
11:24AM INF dropped table `albert`.`call_center2`
11:24AM INF created table `albert`.`call_center2`
11:24AM INF inserted 4 rows into `albert`.`call_center2` in 0 secs [7 r/s]
11:24AM INF execution succeeded
atwong@Albert-CelerData auxjars % sling run --src-stream 'file:///Users/atwong/Downloads/example1.csv' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.call_center2' --mode full-refresh --src-options '{"header":false}' --primary-key col_01
11:24AM INF connecting to target database (starrocks)
11:24AM INF reading from source file system (file)
11:24AM INF writing to target database [mode: full-refresh]
11:24AM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not create temp table `albert`.`call_center2_tmp`
~ Could not generate DDL for `albert`.`call_center2_tmp`
did not provide primary key for creating StarRocks table
atwong@Albert-CelerData auxjars % sling run --src-stream 'file:///Users/atwong/Downloads/example1.csv' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.call_center3' --mode full-refresh --src-options '{"header":false}' --primary-key col_01
11:25AM INF connecting to target database (starrocks)
11:25AM INF reading from source file system (file)
11:25AM INF writing to target database [mode: full-refresh]
11:25AM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not create temp table `albert`.`call_center3_tmp`
~ Could not generate DDL for `albert`.`call_center3_tmp`
did not provide primary key for creating StarRocks table
StarRocks > select * from call_center2
    -> ;
+---------+---------+---------+------------------+
| col_001 | col_002 | col_003 | _sling_loaded_at |
+---------+---------+---------+------------------+
|       1 | Lily    |      23 |       1707247462 |
|       2 | Rose    |      23 |       1707247462 |
|       4 | Julia   |      25 |       1707247462 |
|       3 | Alice   |      24 |       1707247462 |
+---------+---------+---------+------------------+
4 rows in set (0.02 sec)
alberttwong commented 7 months ago

interesting.... name and code are equal to each other in json but it's not when we get to table create. Is there a way to reorder the sequence in table create? If not, I can see if I can make changes on the DB side to reorder columns in table create. Is there also a debug mode so I can see the sql executed?

atwong@Albert-CelerData auxjars % sling run --src-stream 'file:///Users/atwong/Downloads/example2.json' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.call_center4' --mode full-refresh --src-options '{"flatten":true}' --primary-key name
11:29AM INF connecting to target database (starrocks)
11:29AM INF reading from source file system (file)
11:29AM INF writing to target database [mode: full-refresh]
11:29AM INF execution failed
fatal:
~
~ failure running task (see docs @ https://docs.slingdata.io/sling-cli)
~
~ execution failed
~ could not write to database
~ could not create temp table `albert`.`call_center4_tmp`
~ Error creating table `albert`.`call_center4_tmp`
~ Could not execute SQL
~ Error executing [tx: false] create table if not exists `***`.`call_center4_tmp` (`code` decimal(24,9),
`name` string,
`_sling_loaded_at` integer) duplicate key(`name`) distributed by hash(`name`)
Error 1064: Getting analyzing error. Detail message: Key columns must be the first few columns of the schema and the order  of the key columns must be consistent with the order of the schema.
atwong@Albert-CelerData auxjars % sling run --src-stream 'file:///Users/atwong/Downloads/example2.json' --tgt-conn STARROCKSLOCAL --tgt-object 'albert.call_center4' --mode full-refresh --src-options '{"flatten":true}' --primary-key code
11:30AM INF connecting to target database (starrocks)
11:30AM INF reading from source file system (file)
11:30AM INF writing to target database [mode: full-refresh]
11:30AM INF streaming data
11:30AM WRN Using INSERT mode which is meant for small datasets. Please set the `fe_url` for loading large datasets via Stream Load mode. See https://docs.slingdata.io/connections/database-connections/starrocks
11:30AM INF dropped table `albert`.`call_center4`
11:30AM INF created table `albert`.`call_center4`
11:30AM INF inserted 1 rows into `albert`.`call_center4` in 0 secs [2 r/s]
11:30AM INF execution succeeded
alberttwong commented 7 months ago

Oh, will there be an alternative to bulk load via HTTP (FE)? or will the INSERT VALUES performance be improved for large datasets?

I think we're going to consolidate to duckdb's way of doing files() and then work on performance then.

alberttwong commented 7 months ago

On create table, Allow users to pick a key from any of the columns instead of picking the first entry in the list of columns. https://github.com/StarRocks/starrocks/issues/40925

flarco commented 7 months ago

To debug, use add flag -d.

I tried 2 CSV files, one with/without header, they both worked fine for me.

Here are the commands I used:

./sling run --src-stream file://./tests/files/test1.csv --tgt-conn starrocks --tgt-object public.test1 --primary-key id -d

./sling run --src-stream file://./tests/files/test2.csv --tgt-conn starrocks --tgt-object public.test2 --primary-key col_001 --src-options '{header: false}' -d

alberttwong commented 7 months ago

I tried 2 CSV files, one with/without header, they both worked fine for me.

I didn't get any issues with CSV. It's picking the primary key for json import.

flarco commented 7 months ago

Cool. Any other issues do you foresee? I think i'll merge soon if not, looks like it's working.

alberttwong commented 7 months ago

Is there any way to force the column order on your side while I work on this StarRocks enhancement? Or than that, LGTM.

flarco commented 7 months ago

If the source is a file, then no, it will maintain the column order. JSON flatten order is non-deterministic (the order that the keys are encountered). If the source is a database, then you can do custom SQL to reorder the columns, other than that, no.