datafuselabs / databend

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.29k stars 701 forks source link

bug: faild to load gz tsv files into databend #15383

Open sundy-li opened 2 weeks ago

sundy-li commented 2 weeks ago

Summary

CREATE OR REPLACE TABLE hackernews_1m
(
    `id` BIGINT,
    `deleted` TINYINT,
    `type` String,
    `author` String,
    `timestamp` DateTime,
    `comment` String,
    `dead` TINYINT,
    `parent` BIGINT,
    `poll` BIGINT,
    `children` Array(BIGINT),
    `url` String,
    `score` INT,
    `title` String,
    `parts` Array(INT),
    `descendants` BIGINT
);

wget https://doris-build-1308700295.cos.ap-beijing.myqcloud.com/regression/index/hacknernews_1m.csv.gz

❯  curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m  FILE_FORMAT = (type = TSV) ' -F 'upload=@"./hacknernews_1m.csv.gz"'

{"error":{"code":"400","message":"execute fail: Invalid value '\u001f�\b\b�\u000b�c\u0000\u0003hacknernews_1m.csv\u0000��K��:�%\b�o�\"\u0006p�L|�>��j�U3+;9K\u0002�\u0003[\u000b�GE�f�\u001d�$�\u0004A<S�OO�\u0013�������������' for column 0 (id Int64 NULL): invalid text for number\nat file 'hacknernews_1m.csv.gz', line 0"}}%

# works
❯  curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m  FILE_FORMAT = (type = TSV) ' -F 'upload=@"./hacknernews_1m.csv"'
{"id":"00d5bae1-a3f8-4ce4-8dc2-c72998456002","state":"SUCCESS","stats":{"rows":2000000,"bytes":642061025},"error":null,"files":["hacknernews_1m.csv"]}%
youngsofun commented 2 weeks ago

use (type = TSV compression=gzip) like in copy

 curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m  FILE_FORMAT = (type = TSV  compression=gzip) ' -F 'upload=@"./hacknernews_1m.csv.gz"'

{"id":"ed17ccdc-2ea2-4fac-9ff9-97b0f61fd487","state":"SUCCESS","stats":{"rows":1000000,"bytes":130618406},"error":null,"files":["hacknernews_1m.csv.gz"]}% 
sundy-li commented 2 weeks ago

There are two issues.

  1. wrong stats stats":{"rows":2000000,"bytes":642061025}
  2. let compression = input_context
                    .get_compression_alg(&filename)
                    .map_err(BadRequest)?;

compression is wrong.

youngsofun commented 2 weeks ago

default compression is none not auto

youngsofun commented 2 weeks ago

stats is right on my mac.

the bytes you get is diff too, may be you have a diff hacknernews_1m.csv?


(venv) ➜  test git:(stage2) ls -lh hacknernews_1m.csv*
-rw-r--r--  1 yangxiufeng  staff   342M Sep  3  2023 hacknernews_1m.csv
-rw-r--r--  1 yangxiufeng  staff   125M Apr 30 16:42 hacknernews_1m.csv.gz

(venv) ➜  test git:(stage2) curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m  FILE_FORMAT = (type = TSV  compression=gzip) ' -F 'upload=@"./hacknernews_1m.csv.gz"'

{"id":"c0766b8a-77c5-477c-9a48-01e09a895ad1","state":"SUCCESS","stats":{"rows":1000000,"bytes":130618406},"error":null,"files":["hacknernews_1m.csv.gz"]}%                                                                         
(venv) ➜  test git:(stage2) curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m  FILE_FORMAT = (type = TSV  compression=none) ' -F 'upload=@"./hacknernews_1m.csv"'

{"id":"08b3d058-13ab-4303-8925-e40ac971a2dd","state":"SUCCESS","stats":{"rows":1000000,"bytes":358551613},"error":null,"files":["hacknernews_1m.csv"]}%                                                                            (venv) ➜  test git:(stage2) curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m  FILE_FORMAT = (type = TSV  compression=none) ' -F 'upload=@"./hacknernews_1m.csv"'

{"id":"6b5a826b-ec49-47e0-a2ba-75bf372dd36f","state":"SUCCESS","stats":{"rows":1000000,"bytes":358551613},"error":null,"files":["hacknernews_1m.csv"]}%    
sundy-li commented 2 weeks ago
❯ curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H 'insert_sql: insert into hackernews_1m  FILE_FORMAT = (type = TSV) ' -F 'upload=@"./hacknernews_1m.csv"'
{"id":"546943ab-f100-4a39-aee8-adca6ca27596","state":"SUCCESS","stats":{"rows":2000000,"bytes":642061026},"error":null,"files":["hacknernews_1m.csv"]}%
❯ wc -l ./hacknernews_1m.csv
1000000 ./hacknernews_1m.csv
❯ ls -lh hacknernews_1m.csv*
-rw-r--r-- 1 sundy sundy 342M Sep  3  2023 hacknernews_1m.csv