datafuselabs / databend

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.69k stars 727 forks source link

bug: SQL Error [1046]: Query failed (#): Invalid Parquet file. Size is smaller than footer. #14856

Open zhicwu opened 6 months ago

zhicwu commented 6 months ago

Search before asking

Version

nightly

What's Wrong?

select * from 'https://domain.name/test.parquet' ended up with below error. The same query works well on both DuckDB and ClickHouse.

SQL Error [1046]: Query failed (#): Invalid Parquet file. Size is smaller than footer.

How to Reproduce?

Issue query select * from 'https://domain.name/test.parquet' using latest JDBC driver against nightly build. Make sure the web server only respond 200(without header like Content-Length) to HEAD requests:

curl -I -v 'https://domain.name/test.parquet'
...
> HEAD /test.parquet HTTP/1.1
> User-Agent: curl/7.29.0
> Host: domain.name
> Accept: */*
> 
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Wed, 06 Mar 2024 07:48:06 GMT
Date: Wed, 06 Mar 2024 07:48:06 GMT

FYI, here https://domain.name/test.parquet is NOT a static file. The content is generated for each GET request backed by a short-lived cache. Would be great if Databend can still query parquet file without knowing its size in advance.

Are you willing to submit PR?

sundy-li commented 6 months ago

Would be great if Databend can still query parquet file without knowing its size in advance.

Currently, select from uri depends on the content length response.

youngsofun commented 3 months ago

2 choices:

  1. we trait http specially, read it as a stream
  2. report error when no Content-Length header, but we are not sure about it with the opendal interface @Xuanwo
    pub fn content_length(&self) -> u64 {
        debug_assert!(
            self.metakey.contains(Metakey::ContentLength)
                || self.metakey.contains(Metakey::Complete),
            "visiting not set metadata: content_length, maybe a bug"
        );

        self.content_length.unwrap_or_default()
    }
Xuanwo commented 3 months ago

We can't support reading parquet without knowing it's length since we should read from the end to get it's metadata.

youngsofun commented 3 months ago

We can't support reading parquet without knowing it's length since we should read from the end to get it's metadata.

yes, although we can read the whole file into mem first

but maybe we do not want in copy,and require some changes.

and for querying stage,currently need to read schema for binding