apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.93k stars 1.12k forks source link

Implement `hf://` / "hugging face" integration in datafusion-cli #10720

Open alamb opened 3 months ago

alamb commented 3 months ago

Is your feature request related to a problem or challenge?

The DuckDB blog shows off a really cool new feature (access remote datasets from hugging face):

https://duckdb.org/2024/05/29/access-150k-plus-datasets-from-hugging-face-with-duckdb

I think doing this with DataFusion would be quite cool and quite simple to implement. Being able to add such support quickly would be a good example of how datafusion's extensibility allows rapid feature development as well as being a cool project.

Describe the solution you'd like

I would like to support this type of query from datafusion-cli:

SELECT *
FROM 'hf://datasets/datasets-examples/doc-formats-csv-1/data.csv';

Describe alternatives you've considered

I think we can just follow the same model as the existing object store integration in datafusion-cli

https://github.com/apache/datafusion/blob/088ad010a6ceaa6a2e810d418a2370e45acf3d54/datafusion-cli/src/object_storage.rs#L419-L496

And register the hf url with a specially created Http object store instance

Additional context

No response

xinlifoobar commented 3 months ago

I am pretty interesting in this idea. I saw duckdb just implements Hugging face Authentications via

 CREATE SECRET hf_token (
    TYPE HUGGINGFACE,
    TOKEN 'your_hf_token'
 );

or

 CREATE SECRET hf_token (
    TYPE HUGGINGFACE,
    PROVIDER CREDENTIAL_CHAIN
 );

Is there an equivalent API in datafusion?

alamb commented 3 months ago

Is there an equivalent API in datafusion?

The equivalent can be specified as part of each external table definition. For example https://datafusion.apache.org/user-guide/cli/datasources.html#s3

CREATE EXTERNAL TABLE test
STORED AS PARQUET
OPTIONS(
    'aws.access_key_id' '******',
    'aws.secret_access_key' '******',
    'aws.region' 'us-east-2'
)
LOCATION 's3://bucket/path/file.parquet';

This isn't quite as good as a secret that can be reused but it should work

xinlifoobar commented 3 months ago

take

xinlifoobar commented 3 months ago

PR to reference: https://github.com/duckdb/duckdb/pull/11831/files#diff-e0d4fb8749355dd063169c27338bd119b7814546a06720ee1cd18caf83ad5106

xinlifoobar commented 3 months ago

Hey @alamb, need some help while implementing the wildcard functions. Did datafusion support select from a wildcard of external files, e.g., select from 's3://some-bucket/test/.parquet'. I read through the doc but failed to find a proper example..

alamb commented 3 months ago

Hey @alamb, need some help while implementing the wildcard functions. Did datafusion support select from a wildcard of external files, e.g., select from 's3://some-bucket/test/.parquet'. I read through the doc but failed to find a proper example..

I don't think it supports wildcards but instead the Listing table to read all the files in a directory (that have the correct extension):

https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table

CREATE EXTERNAL TABLE test
STORED AS CSV
LOCATION '/path/to/directory/of/files'
OPTIONS ('has_header' 'true');
findepi commented 4 weeks ago

https://github.com/apache/datafusion/issues/11979 is probably related