Open alamb opened 3 months ago
I am pretty interesting in this idea. I saw duckdb just implements Hugging face Authentications via
CREATE SECRET hf_token (
TYPE HUGGINGFACE,
TOKEN 'your_hf_token'
);
or
CREATE SECRET hf_token (
TYPE HUGGINGFACE,
PROVIDER CREDENTIAL_CHAIN
);
Is there an equivalent API in datafusion?
Is there an equivalent API in datafusion?
The equivalent can be specified as part of each external table definition. For example https://datafusion.apache.org/user-guide/cli/datasources.html#s3
CREATE EXTERNAL TABLE test
STORED AS PARQUET
OPTIONS(
'aws.access_key_id' '******',
'aws.secret_access_key' '******',
'aws.region' 'us-east-2'
)
LOCATION 's3://bucket/path/file.parquet';
This isn't quite as good as a secret that can be reused but it should work
take
Hey @alamb, need some help while implementing the wildcard functions. Did datafusion support select from a wildcard of external files, e.g., select from 's3://some-bucket/test/.parquet'. I read through the doc but failed to find a proper example..
Hey @alamb, need some help while implementing the wildcard functions. Did datafusion support select from a wildcard of external files, e.g., select from 's3://some-bucket/test/.parquet'. I read through the doc but failed to find a proper example..
I don't think it supports wildcards but instead the Listing table to read all the files in a directory (that have the correct extension):
https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table
CREATE EXTERNAL TABLE test
STORED AS CSV
LOCATION '/path/to/directory/of/files'
OPTIONS ('has_header' 'true');
https://github.com/apache/datafusion/issues/11979 is probably related
Is your feature request related to a problem or challenge?
The DuckDB blog shows off a really cool new feature (access remote datasets from hugging face):
https://duckdb.org/2024/05/29/access-150k-plus-datasets-from-hugging-face-with-duckdb
I think doing this with DataFusion would be quite cool and quite simple to implement. Being able to add such support quickly would be a good example of how datafusion's extensibility allows rapid feature development as well as being a cool project.
Describe the solution you'd like
I would like to support this type of query from
datafusion-cli
:Describe alternatives you've considered
I think we can just follow the same model as the existing object store integration in datafusion-cli
https://github.com/apache/datafusion/blob/088ad010a6ceaa6a2e810d418a2370e45acf3d54/datafusion-cli/src/object_storage.rs#L419-L496
And register the
hf
url with a specially createdHttp
object store instanceAdditional context
No response