ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
36.62k stars 6.77k forks source link

Support for `_path` and `_file` virtual columns for table function `url`. #49231

Open alexey-milovidov opened 1 year ago

alexey-milovidov commented 1 year ago
play-eu :) SELECT _file, count() FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/wikistat/original/pageviews-20200101-00{00..59}00.gz', LineAsString) GROUP BY _file ORDER BY _file

SELECT
    _file,
    count()
FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/wikistat/original/pageviews-20200101-00{00..59}00.gz', LineAsString)
GROUP BY _file
ORDER BY _file ASC

Query id: a208a18c-02f9-47cc-8848-a6895bba3b49

0 rows in set. Elapsed: 0.023 sec. 

Received exception from server (version 23.4.1):
Code: 47. DB::Exception: Received from localhost:9000. DB::Exception: Missing columns: '_file' while processing query: 'SELECT _file, count() FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/wikistat/original/pageviews-20200101-00{00..59}00.gz', 'LineAsString') GROUP BY _file ORDER BY _file ASC', required columns: '_file'. (UNKNOWN_IDENTIFIER)

play-eu :) SELECT _path, count() FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/wikistat/original/pageviews-20200101-00{00..59}00.gz', LineAsString) GROUP BY _path ORDER BY _path

SELECT
    _path,
    count()
FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/wikistat/original/pageviews-20200101-00{00..59}00.gz', LineAsString)
GROUP BY _path
ORDER BY _path ASC

Query id: 495fdc97-0069-4da7-b2ac-af015fb45f25

0 rows in set. Elapsed: 0.025 sec. 

Received exception from server (version 23.4.1):
Code: 47. DB::Exception: Received from localhost:9000. DB::Exception: Missing columns: '_path' while processing query: 'SELECT _path, count() FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/wikistat/original/pageviews-20200101-00{00..59}00.gz', 'LineAsString') GROUP BY _path ORDER BY _path ASC', required columns: '_path'. (UNKNOWN_IDENTIFIER)
Ziy1-Tan commented 1 year ago

Is this two separate tasks from https://github.com/ClickHouse/ClickHouse/issues/49232 ?

alexey-milovidov commented 1 year ago

Yes, but they can be solved at once.

alexey-milovidov commented 1 year ago

@Ziy1-Tan, it does not work at all.

danthegoodman1 commented 1 year ago

It would also be preferred if these pruned files before reading them with glob pattern

Ziy1-Tan commented 1 year ago

@Ziy1-Tan, it does not work at all.

Oh, which case it does work? Let me check it.

alexey-milovidov commented 1 year ago

@Ziy1-Tan the examples from the description:

$ clickhouse-local 
ClickHouse local version 23.8.1.1.

milovidov-desktop :) SELECT _file, count() FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/wikistat/original/pageviews-20200101-00{00..59}00.gz', LineAsString) GROUP BY _file ORDER BY _file

SELECT
    _file,
    count()
FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/wikistat/original/pageviews-20200101-00{00..59}00.gz', LineAsString)
GROUP BY _file
ORDER BY _file ASC

Query id: 46765e3a-73fa-4da4-b032-669fc5b0a07b

↘ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.)  0%
0 rows in set. Elapsed: 0.774 sec. 

Received exception:
Code: 354. DB::Exception: inflate failed: buffer error: While executing LineAsStringRowInputFormat: While executing URL. (ZLIB_INFLATE_FAILED)

milovidov-desktop :) SELECT _path, count() FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/wikistat/original/pageviews-20200101-00{00..59}00.gz', LineAsString) GROUP BY _path ORDER BY _path

SELECT
    _path,
    count()
FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/wikistat/original/pageviews-20200101-00{00..59}00.gz', LineAsString)
GROUP BY _path
ORDER BY _path ASC

Query id: c40d5230-4921-465b-80a5-0a4a7cb2686b

← Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.)  0%
0 rows in set. Elapsed: 0.784 sec. 

Received exception:
Code: 354. DB::Exception: inflate failed: buffer error: While executing LineAsStringRowInputFormat: While executing URL. (ZLIB_INFLATE_FAILED)
alexey-milovidov commented 12 months ago

@Ziy1-Tan, are you still working on this?

Ziy1-Tan commented 12 months ago

@Ziy1-Tan, are you still working on this?

sorry, I'm busy recently. Contributions welcome:).

alexey-milovidov commented 12 months ago

No problem - we will open it for other contributors.