apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.95k stars 1.13k forks source link

Files without `.parquet`, `.csv` extension inferred as having no schema #1736

Closed tustvold closed 1 year ago

tustvold commented 2 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I wrote a test approximating

let file = tempfile::tempfile();

// ... write parquet data ...

let mut context = ExecutionContext::new();
context.register_parquet("t", file.path().as_str())
context.sql("select column from t");

This would result in "Invalid identifier" errors, effectively claiming the column didn't exist. I verified the file existed, had the correct columns, etc... I was very confused :laughing:

Eventually I tracked this down to the schema being inferred as empty if the extension is not ".parquet", this feels unexpected

Describe the solution you'd like

Either register_parquet should return an error if the extension is missing, or FileFormat::infer_schema should be more agnostic to file extensions.

alamb commented 2 years ago

Change this description to be a bug, which I think better reflects what is going on

Ted-Jiang commented 2 years ago

@tustvold may i ask you how to write a tmp parquet file, IMHP i think all parquet data file should end .parquet.

tustvold commented 2 years ago

@Ted-Jiang Remove this line in the SQL benchmark https://github.com/apache/arrow-datafusion/pull/1738/files#diff-d1dbff8af63c3a3fe4d918432f982181b40fa4b7e1641522a6a48904f521fc89R143 and that was what caused issues.

I don't feel strongly whether the .parquet file should be mandatory or not, but I do think silently inferring an empty schema makes for a pretty poor UX :sweat_smile:

alamb commented 2 years ago

but I do think silently inferring an empty schema makes for a pretty poor UX 😅

I agree -- an error about "can not infer schema" seems much better than silently ignoring

andygrove commented 1 year ago

I plan on fixing this for 14.0.0

andygrove commented 1 year ago

Some notes from investigating this.

This is where the EmptyExec gets introduced. I ran into this issue in Ballista as well in https://github.com/apache/arrow-ballista/pull/414

        // if no files need to be read, return an `EmptyExec`
        if partitioned_file_lists.is_empty() {
            let schema = self.schema();
            let projected_schema = project_schema(&schema, projection.as_ref())?;
            return Ok(Arc::new(EmptyExec::new(false, projected_schema)));
        }

The problem is I am registering a CSV data source and the filename does not end with .csv and we have hard-coded assumptions that CSV files must end with ".csv", so we cannot support .tsv or .tbl (without the user explicitly specifying this, and it is not possible to specify this in SQL as far as I know).

I think we just need to add an error check for this case and prompt the user to specify the file extension they want rather than use the default.

andygrove commented 1 year ago

I propose that the error check we add is that at least one file exists with the specified extension.

alamb commented 1 year ago

I propose that the error check we add is that at least one file exists with the specified extension.

Makes sense to me

liningpan commented 1 year ago

I also ran into this issue. The problem seems to be even if only a single file is provided, we still try to match by extension.

https://github.com/apache/arrow-datafusion/blob/0e5f6df2c4fb2d647874102a19c74eaaf7f34d98/datafusion/core/src/datasource/listing/url.rs#L143-L175

If my use case always uses a single csv file for each table, would a reasonable workaround be setting file extension to an empty string in CsvReadOptions? Are there any problems?

ctx.register_csv(
    name,
    path,
    CsvReadOptions::new().file_extension("")
).await?
andygrove commented 1 year ago

Maybe file_extension could be made an Option<String>, defaulting to None?

alamb commented 1 year ago

I tried this usecase with the fix in https://github.com/apache/arrow-datafusion/issues/6147 from @aprimadi and it works great :

(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion/datafusion-cli$ head /tmp/test/customer
1|Customer#000000001|IVhzIApeRb ot,c,E|15|25-989-741-2988|711.56|BUILDING|to the even, regular platelets. regular, ironic epitaphs nag e|
2|Customer#000000002|XSTf4,NCwDVaWNe6tEgvwfmRchLXak|13|23-768-687-3665|121.65|AUTOMOBILE|l accounts. blithely ironic theodolites integrate boldly: caref|
3|Customer#000000003|MG9kdTD2WBHm|1|11-719-748-3364|7498.12|AUTOMOBILE| deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov|
4|Customer#000000004|XxVSJsLAGtn|4|14-128-190-5944|2866.83|MACHINERY| requests. final, regular ideas sleep final accou|
5|Customer#000000005|KvpyuHCplrB84WgAiGV6sYpZq7Tj|3|13-750-942-6364|794.47|HOUSEHOLD|n accounts will have to unwind. foxes cajole accor|
6|Customer#000000006|sKZz0CsnMD7mp4Xd0YrBvx,LREYKUWAh yVn|20|30-114-968-4951|7638.57|AUTOMOBILE|tions. even deposits boost according to the slyly bold packages. final accounts cajole requests. furious|
7|Customer#000000007|TcGe5gaZNgVePxU5kRrvXBfkasDTea|18|28-190-982-9759|9561.95|AUTOMOBILE|ainst the ironic, express theodolites. express, even pinto beans among the exp|
8|Customer#000000008|I0B10bB0AymmC, 0PrRYBCP1yGJ8xcBPmWhl5|17|27-147-574-9335|6819.74|BUILDING|among the slyly regular theodolites kindle blithely courts. carefully even theodolites haggle slyly along the ide|
9|Customer#000000009|xKiAFTjUsCuxfeleNqefumTrjS|8|18-338-906-3675|8324.07|FURNITURE|r theodolites according to the requests wake thinly excuses: pending requests haggle furiousl|
10|Customer#000000010|6LrEaV6KR6PLVcgl2ArL Q3rqzLzcT1 v2|5|15-741-346-9870|2753.54|HOUSEHOLD|es regular deposits haggle. fur|
(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion/datafusion-cli$ CARGO_TARGET_DIR=/Users/alamb/Software/target-df cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.29s
     Running `/Users/alamb/Software/target-df/debug/datafusion-cli`
DataFusion CLI v23.0.0
❯ CREATE EXTERNAL TABLE customer STORED AS CSV DELIMITER '|' LOCATION  '/tmp/test/customer';
0 rows in set. Query took 0.290 seconds.
❯ select * from customer limit 1;
+----------+--------------------+-------------------+----------+-----------------+----------+----------+----------------------------------------------------------------+----------+
| column_1 | column_2           | column_3          | column_4 | column_5        | column_6 | column_7 | column_8                                                       | column_9 |
+----------+--------------------+-------------------+----------+-----------------+----------+----------+----------------------------------------------------------------+----------+
| 1        | Customer#000000001 | IVhzIApeRb ot,c,E | 15       | 25-989-741-2988 | 711.56   | BUILDING | to the even, regular platelets. regular, ironic epitaphs nag e |          |
+----------+--------------------+-------------------+----------+-----------------+----------+----------+----------------------------------------------------------------+----------+
1 row in set. Query took 0.065 seconds.