GlareDB / glaredb

GlareDB: An analytics DBMS for distributed data
https://glaredb.com
GNU Affero General Public License v3.0
676 stars 39 forks source link

Read over http(s) ? #3134

Open jgranduel opened 2 months ago

jgranduel commented 2 months ago

Description

With duckdb, one can read csv, json or parquet files which are accessed through a web server: ex:

SELECT *
FROM read_parquet('https://some.url/some_file.parquet');

See duckdb parquet .

Will you implement such a feature in glaredb ? Authentication might be tricky in some cases (I thought it was present in datafusion, but I must be wrong as couldn't find any reference).

tychoish commented 2 months ago

This should work in many cases today!

As you say, there are a bunch of edge cases around authentication, but it definitely works:

I just ran:

select count(*) from read_parquet('https://huggingface.co/api/datasets/RLHFlow/Orca-distibalel-standard/parquet/default/train/0.parquet');

And it worked just fine.

Our read_json() function also supports both reading top-level JSON arrays (as you might expect from some public APIs, but also supports streams with either/both new-line delimited and multi-line JSON objects.

Let me know if you have specific edge cases that you'd like to address or other concerns!

jgranduel commented 2 months ago

Thanks for your response.

Your example ("select count(*) from read_parquet('https://huggingface.co/api/da[..]')" indeed worked also on my side. I tried another example that we can share if needed ("https://reqres.in/api/users") and it worked well (except that I don't know how to unnest a JSON array as in duckdb ( with t(data) as (select unnest(data) from read_json('https://reqres.in/api/users')) select data.id, data.email, data.first_name, data.last_name, data.avatar from t; but that's another question 😉).

So I tried to reproduce what I did yesterday. After trying with a local file, I wanted to check if I could load data with http, using simple http file servers. I tried with miniserve, and it didn't work. I also tried today with another very basic web server (jwebserver), same result, and finally with hfs and it worked. The difference is that it uses by default port 80🗦💡🗧.

Here are my tests:

glaredb-101> cat .\data.csv
a,b
1,2
glaredb-101> glaredb.exe
GlareDB (v0.9.4)
Type \help for help.
> select * from read_csv('data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘

then simple test with miniserve (default port is 8080)

glaredb-101> miniserve.exe .
miniserve v0.22.0
Bound to [::]:8080, 0.0.0.0:8080
Serving path ...glaredb-101
Available at (non-exhaustive list):
    http://127.0.0.1:8080
    http://<my ip>:8080

 glaredb-101> glaredb.exe
GlareDB (v0.9.4)
Type \help for help.
> select * from read_csv('http://localhost:8080/data.csv');
Error: External error: External error: invalid port number
> select * from read_csv('http://127.0.0.1:8080/data.csv');
Error: External error: External error: invalid IPv4 address
> select * from read_csv('http://<my ip>:8080/data.csv');
Error: External error: External error: invalid IPv4 address

then miniserve running on port 80:

glaredb-101> miniserve.exe -p 80 .
miniserve v0.22.0
Bound to [::]:80, 0.0.0.0:80
Serving path ...\glaredb-101
Available at (non-exhaustive list):
    http://127.0.0.1:80
    http://<my ip>:80

glaredb-101> glaredb
GlareDB (v0.9.4)
Type \help for help.
> select * from read_csv('http://localhost/data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘
> select * from read_csv('http://127.0.0.1/data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘
> select * from read_csv('http://<my ip>/data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘
> select * from read_csv('http://localhost:80/data.csv');
┌───────┬───────┐
│     a │     b │
│    ── │    ── │
│ Int64 │ Int64 │
╞═══════╪═══════╡
│     1 │     2 │
└───────┴───────┘

Using localhost:80 works well. Is there any parsing error with different ports, so an issue with URL parsing?

tychoish commented 2 months ago

Using localhost:80 works well. Is there any parsing error with different ports, so an issue with URL parsing?

That seems like a quite reasonable deduction. It's probably the case that somewhere the port is getting dropped. I suspect this works when you explicitly reference :80 it's being stripped out but it's "fine" because for HTTP requests if you omit the port number the spec says to assume :80.

I've done some poking around and nothing is jumping out at me in either the DF code or the GlareDB code, will continue to look.

except that I don't know how to unnest a JSON array as in duckdb

GlareDB will unwind top-level json arrays in json source data automatically, on the theory that top-level arrays of json objects, can mean nothing else (e.g. these would be an error any other way.)

Coming very shortly, GlareDB, will allow you to filter JSON data through jaq in read_json and in the configuration of external json tables. This will let you re-structure JSON objects as their ingested, if you need something else.