bug: SQL Error [1046]: Query failed (#): Invalid Parquet file. Size is smaller than footer.

zhicwu commented 6 months ago

Search before asking

[X] I had searched in the issues and found no similar issues.

Version

nightly

What's Wrong?

select * from 'https://domain.name/test.parquet' ended up with below error. The same query works well on both DuckDB and ClickHouse.

SQL Error [1046]: Query failed (#): Invalid Parquet file. Size is smaller than footer.

How to Reproduce?

Issue query select * from 'https://domain.name/test.parquet' using latest JDBC driver against nightly build. Make sure the web server only respond 200(without header like Content-Length) to HEAD requests:

curl -I -v 'https://domain.name/test.parquet'
...
> HEAD /test.parquet HTTP/1.1
> User-Agent: curl/7.29.0
> Host: domain.name
> Accept: */*
> 
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Wed, 06 Mar 2024 07:48:06 GMT
Date: Wed, 06 Mar 2024 07:48:06 GMT

FYI, here https://domain.name/test.parquet is NOT a static file. The content is generated for each GET request backed by a short-lived cache. Would be great if Databend can still query parquet file without knowing its size in advance.

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

sundy-li commented 6 months ago

Would be great if Databend can still query parquet file without knowing its size in advance.

Currently, select from uri depends on the content length response.

youngsofun commented 3 months ago

2 choices:

we trait http specially, read it as a stream
report error when no Content-Length header, but we are not sure about it with the opendal interface @Xuanwo

    pub fn content_length(&self) -> u64 {
        debug_assert!(
            self.metakey.contains(Metakey::ContentLength)
                || self.metakey.contains(Metakey::Complete),
            "visiting not set metadata: content_length, maybe a bug"
        );

        self.content_length.unwrap_or_default()
    }

Xuanwo commented 3 months ago

We can't support reading parquet without knowing it's length since we should read from the end to get it's metadata.

youngsofun commented 3 months ago

We can't support reading parquet without knowing it's length since we should read from the end to get it's metadata.

yes， although we can read the whole file into mem first

but maybe we do not want in copy，and require some changes.

and for querying stage，currently need to read schema for binding

datafuselabs / databend