duckdb / duckdb-wasm

WebAssembly version of DuckDB
https://shell.duckdb.org
MIT License
1.14k stars 121 forks source link

file.csv.gz does not work in duckdb-wasm #1172

Open peterboncz opened 1 year ago

peterboncz commented 1 year ago

These queries:

select * from FROM 'https://cdn.gea.esac.esa.int/Gaia/gdr3/gaia_source/GaiaSource_000000-003111.csv.gz' limit 100
select * from FROM 'https://www.cwi.nl/~boncz/gaia99.csv.gz' limit 100

work in normal duckdb (with httpfs loaded -- if not, they work with http:// URLs), but not in duckdb-wasm:

select * from FROM 'https://cdn.gea.esac.esa.int/Gaia/gdr3/gaia_source/GaiaSource_000000-003111.csv.gz' limit 100

produces:

Invalid Input Error: Error in file "https://cdn.gea.esac.esa.int/Gaia/gdr3/gaia_source/GaiaSource_000000-003111.csv.gz": CSV options could not be auto-detected. Consider setting parser options manually.

while

select * from FROM 'https://www.cwi.nl/~boncz/gaia99.csv.gz' limit 100

produces:

Invalid Error: NetworkError: Failed to execute 'send' on 'XMLHttpRequest': Failed to load 'https://www.cwi.nl/~boncz/gaia99.csv.gz'.

which is remarkable, but may have to do with the different behaviour of the latter website.

This bug was first raised as a MotherDuck issue, but the above failures simply happen (in Chrome on an M1 MacBook) on shell.duckdb.org -- note that the same problems happen when wrapping the URL in read_csv_auto()

The main theory is that there is something awry with the duckdb-wasm-specific implementation of the httpfs extension. It seems to fail to read data when triggered in SniffCSV (duckdb/src/execution/operator/persistent/buffered_csv_reader.cpp )

Specifically regarding the .gz unzipping. Because the following query works fine in duckdb-wasm:

SELECT * FROM 'https://raw.githubusercontent.com/duckdb/duckdb/master/data/csv/customer.csv' LIMIT 100;

Note that there might be a relation with: https://github.com/duckdb/duckdb-wasm/issues/1166

kimmolinna commented 1 year ago

Have you tried to add the third parameter for registerFileURL?

await db.registerFileURL('remote.parquet', 'https://origin/remote.parquet', DuckDBDataProtocol.HTTP, false);

Because for example in ObservableHQ this is the issue. Clip from stdlib/duckdb.js

 await database.registerFileURL(file.name, url);

The third parameter is missing. @domoritz's example (#1166) is working when you add the third parameter for registerFileURL.

peterboncz commented 1 year ago

Hi @kimmolinna: this bug report its not about registerFileURL. Apologies for causing confusion by mentioning #1166

kimmolinna commented 1 year ago

@peterboncz The error message looks similar. So It should be checked that the duckdb-wasm internally remembers to use also the third parameter. ;)

peterboncz commented 1 year ago

ah, exactly, that was indeed why I mentioned #1166. I did not venture to look at the httpfs implementation of duckdb-wasm yet.

ankoh commented 1 year ago
Screenshot 2023-03-02 at 22 06 35

There are a few things happening here. The request to cwi.nl is blocked due to CORS. The sniffing is failing as .gz is apparently not working with duckdb-wasm. Your gaia99 file works for me unzipped.

ankoh commented 1 year ago

You can bypass the CORS troubles for testing using:

#!/usr/bin/env python3
from http.server import HTTPServer, SimpleHTTPRequestHandler, test
import sys

class CORSRequestHandler (SimpleHTTPRequestHandler):
    def end_headers (self):
        self.send_header('Access-Control-Allow-Origin', '*')
        SimpleHTTPRequestHandler.end_headers(self)

if __name__ == '__main__':
    test(CORSRequestHandler, HTTPServer, port=int(sys.argv[1]) if len(sys.argv) > 1 else 8000)
kimmolinna commented 1 year ago

@ankoh I don't know does this affect anything but DataProtocols in web_file.rsare

pub enum DataProtocol {
    Buffer = 0,
    Native = 1,
    Http = 2,
}

in DuckDB are:

enum DataProtocol : uint8_t {
        BUFFER = 0,
        NODE_FS = 1,
        BROWSER_FILEREADER = 2,
        BROWSER_FSACCESS = 3,
        HTTP = 4,
        S3 = 5
peterboncz commented 1 year ago

@ankoh thanks for clarifying the CORS issue, but that is a bit of a distraction really. The main point being .gz not working in duckdb-wasm. Do I understand correctly that this is just not supported?

In that case a better error message would be a fix, if it is not possible to support .gz decompression in the httpfs implementation. I suppose that the original filer of the bug report would be most satisfied if that would be supported.

ankoh commented 1 year ago

@ankoh thanks for clarifying the CORS issue, but that is a bit of a distraction really. The main point being .gz not working in duckdb-wasm. Do I understand correctly that this is just not supported? Yes, explicitly gzipped files are currently not supported (unless it goes through http content-encoding, i assume).

@peterboncz Right now, we're running a dedicated file_system here https://github.com/duckdb/duckdb-wasm/blob/master/lib/src/webdb.cc#L805 ... and will therefore never use the gzip filesystem through the VirtualFilesystem of native DuckDB. https://github.com/duckdb/duckdb/blob/master/src/common/virtual_file_system.cpp#L19 But this should be easily fixable by either wrapping the buffered filesystem into DuckDB's virtual filesystem or teaching one of the wasm filesystems about the existing Gzip filesystem.

In that case a better error message would be a fix, if it is not possible to support .gz decompression in the httpfs implementation. I suppose that the original filer of the bug report would be most satisfied if that would be supported.

Totally agree, duckdb-wasm should get better at error reporting.

ankoh commented 1 year ago

@ankoh I don't know does this affect anything but DataProtocols in web_file.rsare

Good catch. that enum needs to be updated, but it only affects the info text printed in the .files command in the shell

ericemc3 commented 8 months ago

is opening csv.gz files on the roadmap?

Silvan-Fischer commented 6 months ago

Any news, workaround on support for .gz in wasm? I trying to load data from the eurostat api and even if i turn compression off it is still not working in wasm SELECT * FROM read_json('https://ec.europa.eu/eurostat/api/dissemination/sdmx/3.0/data/dataflow/ESTAT/ISOC_CI_ID_H/1.0?format=csvdata&compress=false', auto_detect=true, format='auto');

ericemc3 commented 6 months ago

Any news, workaround on support for .gz in wasm? I trying to load data from the eurostat api and even if i turn compression off it is still not working in wasm SELECT * FROM read_json('https://ec.europa.eu/eurostat/api/dissemination/sdmx/3.0/data/dataflow/ESTAT/ISOC_CI_ID_H/1.0?format=csvdata&compress=false', auto_detect=true, format='auto');

This URL returns csv, you might try: FROM read_csv_auto('https://ec.europa.eu/eurostat/api/dissemination/sdmx/3.0/data/dataflow/ESTAT/ISOC_CI_ID_H/1.0?format=csvdata&compress=false')

wldevries commented 1 month ago

This problem applies to the json loader as well. Output below is from the online demo app.

duckdb> select * from read_json(mileusage.jsonl.gz, compression=gzip, format=newline_delimited);
Invalid Input Error: Malformed JSON in file "mileusage.jsonl.gz", at byte 1 in line 2: unexpected character.