Open peterboncz opened 1 year ago
Have you tried to add the third parameter for registerFileURL?
await db.registerFileURL('remote.parquet', 'https://origin/remote.parquet', DuckDBDataProtocol.HTTP, false);
Because for example in ObservableHQ this is the issue. Clip from stdlib/duckdb.js
await database.registerFileURL(file.name, url);
The third parameter is missing.
@domoritz's example (#1166) is working when you add the third parameter for registerFileURL
.
Hi @kimmolinna: this bug report its not about registerFileURL. Apologies for causing confusion by mentioning #1166
@peterboncz The error message looks similar. So It should be checked that the duckdb-wasm internally remembers to use also the third parameter. ;)
ah, exactly, that was indeed why I mentioned #1166. I did not venture to look at the httpfs implementation of duckdb-wasm yet.
There are a few things happening here. The request to cwi.nl is blocked due to CORS. The sniffing is failing as .gz is apparently not working with duckdb-wasm. Your gaia99 file works for me unzipped.
You can bypass the CORS troubles for testing using:
#!/usr/bin/env python3
from http.server import HTTPServer, SimpleHTTPRequestHandler, test
import sys
class CORSRequestHandler (SimpleHTTPRequestHandler):
def end_headers (self):
self.send_header('Access-Control-Allow-Origin', '*')
SimpleHTTPRequestHandler.end_headers(self)
if __name__ == '__main__':
test(CORSRequestHandler, HTTPServer, port=int(sys.argv[1]) if len(sys.argv) > 1 else 8000)
@ankoh I don't know does this affect anything but DataProtocols in web_file.rs
are
pub enum DataProtocol {
Buffer = 0,
Native = 1,
Http = 2,
}
in DuckDB are:
enum DataProtocol : uint8_t {
BUFFER = 0,
NODE_FS = 1,
BROWSER_FILEREADER = 2,
BROWSER_FSACCESS = 3,
HTTP = 4,
S3 = 5
@ankoh thanks for clarifying the CORS issue, but that is a bit of a distraction really. The main point being .gz not working in duckdb-wasm. Do I understand correctly that this is just not supported?
In that case a better error message would be a fix, if it is not possible to support .gz decompression in the httpfs implementation. I suppose that the original filer of the bug report would be most satisfied if that would be supported.
@ankoh thanks for clarifying the CORS issue, but that is a bit of a distraction really. The main point being .gz not working in duckdb-wasm. Do I understand correctly that this is just not supported? Yes, explicitly gzipped files are currently not supported (unless it goes through http content-encoding, i assume).
@peterboncz Right now, we're running a dedicated file_system here https://github.com/duckdb/duckdb-wasm/blob/master/lib/src/webdb.cc#L805 ... and will therefore never use the gzip filesystem through the VirtualFilesystem of native DuckDB. https://github.com/duckdb/duckdb/blob/master/src/common/virtual_file_system.cpp#L19 But this should be easily fixable by either wrapping the buffered filesystem into DuckDB's virtual filesystem or teaching one of the wasm filesystems about the existing Gzip filesystem.
In that case a better error message would be a fix, if it is not possible to support .gz decompression in the httpfs implementation. I suppose that the original filer of the bug report would be most satisfied if that would be supported.
Totally agree, duckdb-wasm should get better at error reporting.
@ankoh I don't know does this affect anything but DataProtocols in
web_file.rs
are
Good catch. that enum needs to be updated, but it only affects the info text printed in the .files
command in the shell
is opening csv.gz files on the roadmap?
Any news, workaround on support for .gz in wasm? I trying to load data from the eurostat api and even if i turn compression off it is still not working in wasm
SELECT * FROM read_json('https://ec.europa.eu/eurostat/api/dissemination/sdmx/3.0/data/dataflow/ESTAT/ISOC_CI_ID_H/1.0?format=csvdata&compress=false', auto_detect=true, format='auto');
Any news, workaround on support for .gz in wasm? I trying to load data from the eurostat api and even if i turn compression off it is still not working in wasm
SELECT * FROM read_json('https://ec.europa.eu/eurostat/api/dissemination/sdmx/3.0/data/dataflow/ESTAT/ISOC_CI_ID_H/1.0?format=csvdata&compress=false', auto_detect=true, format='auto');
This URL returns csv, you might try:
FROM read_csv_auto('https://ec.europa.eu/eurostat/api/dissemination/sdmx/3.0/data/dataflow/ESTAT/ISOC_CI_ID_H/1.0?format=csvdata&compress=false')
This problem applies to the json loader as well. Output below is from the online demo app.
duckdb> select * from read_json(mileusage.jsonl.gz, compression=gzip, format=newline_delimited);
Invalid Input Error: Malformed JSON in file "mileusage.jsonl.gz", at byte 1 in line 2: unexpected character.
These queries:
work in normal duckdb (with httpfs loaded -- if not, they work with http:// URLs), but not in duckdb-wasm:
produces:
Invalid Input Error: Error in file "https://cdn.gea.esac.esa.int/Gaia/gdr3/gaia_source/GaiaSource_000000-003111.csv.gz": CSV options could not be auto-detected. Consider setting parser options manually.
while
produces:
Invalid Error: NetworkError: Failed to execute 'send' on 'XMLHttpRequest': Failed to load 'https://www.cwi.nl/~boncz/gaia99.csv.gz'.
which is remarkable, but may have to do with the different behaviour of the latter website.
This bug was first raised as a MotherDuck issue, but the above failures simply happen (in Chrome on an M1 MacBook) on shell.duckdb.org -- note that the same problems happen when wrapping the URL in read_csv_auto()
The main theory is that there is something awry with the duckdb-wasm-specific implementation of the httpfs extension. It seems to fail to read data when triggered in SniffCSV (duckdb/src/execution/operator/persistent/buffered_csv_reader.cpp )
Specifically regarding the .gz unzipping. Because the following query works fine in duckdb-wasm:
Note that there might be a relation with: https://github.com/duckdb/duckdb-wasm/issues/1166