Closed tustvold closed 1 year ago
Change this description to be a bug, which I think better reflects what is going on
@tustvold may i ask you how to write a tmp parquet file, IMHP i think all parquet data file should end .parquet.
@Ted-Jiang Remove this line in the SQL benchmark https://github.com/apache/arrow-datafusion/pull/1738/files#diff-d1dbff8af63c3a3fe4d918432f982181b40fa4b7e1641522a6a48904f521fc89R143 and that was what caused issues.
I don't feel strongly whether the .parquet
file should be mandatory or not, but I do think silently inferring an empty schema makes for a pretty poor UX :sweat_smile:
but I do think silently inferring an empty schema makes for a pretty poor UX 😅
I agree -- an error about "can not infer schema" seems much better than silently ignoring
I plan on fixing this for 14.0.0
Some notes from investigating this.
This is where the EmptyExec
gets introduced. I ran into this issue in Ballista as well in https://github.com/apache/arrow-ballista/pull/414
// if no files need to be read, return an `EmptyExec`
if partitioned_file_lists.is_empty() {
let schema = self.schema();
let projected_schema = project_schema(&schema, projection.as_ref())?;
return Ok(Arc::new(EmptyExec::new(false, projected_schema)));
}
The problem is I am registering a CSV data source and the filename does not end with .csv
and we have hard-coded assumptions that CSV files must end with ".csv"
, so we cannot support .tsv
or .tbl
(without the user explicitly specifying this, and it is not possible to specify this in SQL as far as I know).
I think we just need to add an error check for this case and prompt the user to specify the file extension they want rather than use the default.
I propose that the error check we add is that at least one file exists with the specified extension.
I propose that the error check we add is that at least one file exists with the specified extension.
Makes sense to me
I also ran into this issue. The problem seems to be even if only a single file is provided, we still try to match by extension.
If my use case always uses a single csv file for each table, would a reasonable workaround be setting file extension to an empty string in CsvReadOptions
? Are there any problems?
ctx.register_csv(
name,
path,
CsvReadOptions::new().file_extension("")
).await?
Maybe file_extension
could be made an Option<String>
, defaulting to None
?
I tried this usecase with the fix in https://github.com/apache/arrow-datafusion/issues/6147 from @aprimadi and it works great :
(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion/datafusion-cli$ head /tmp/test/customer
1|Customer#000000001|IVhzIApeRb ot,c,E|15|25-989-741-2988|711.56|BUILDING|to the even, regular platelets. regular, ironic epitaphs nag e|
2|Customer#000000002|XSTf4,NCwDVaWNe6tEgvwfmRchLXak|13|23-768-687-3665|121.65|AUTOMOBILE|l accounts. blithely ironic theodolites integrate boldly: caref|
3|Customer#000000003|MG9kdTD2WBHm|1|11-719-748-3364|7498.12|AUTOMOBILE| deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov|
4|Customer#000000004|XxVSJsLAGtn|4|14-128-190-5944|2866.83|MACHINERY| requests. final, regular ideas sleep final accou|
5|Customer#000000005|KvpyuHCplrB84WgAiGV6sYpZq7Tj|3|13-750-942-6364|794.47|HOUSEHOLD|n accounts will have to unwind. foxes cajole accor|
6|Customer#000000006|sKZz0CsnMD7mp4Xd0YrBvx,LREYKUWAh yVn|20|30-114-968-4951|7638.57|AUTOMOBILE|tions. even deposits boost according to the slyly bold packages. final accounts cajole requests. furious|
7|Customer#000000007|TcGe5gaZNgVePxU5kRrvXBfkasDTea|18|28-190-982-9759|9561.95|AUTOMOBILE|ainst the ironic, express theodolites. express, even pinto beans among the exp|
8|Customer#000000008|I0B10bB0AymmC, 0PrRYBCP1yGJ8xcBPmWhl5|17|27-147-574-9335|6819.74|BUILDING|among the slyly regular theodolites kindle blithely courts. carefully even theodolites haggle slyly along the ide|
9|Customer#000000009|xKiAFTjUsCuxfeleNqefumTrjS|8|18-338-906-3675|8324.07|FURNITURE|r theodolites according to the requests wake thinly excuses: pending requests haggle furiousl|
10|Customer#000000010|6LrEaV6KR6PLVcgl2ArL Q3rqzLzcT1 v2|5|15-741-346-9870|2753.54|HOUSEHOLD|es regular deposits haggle. fur|
(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion/datafusion-cli$ CARGO_TARGET_DIR=/Users/alamb/Software/target-df cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.29s
Running `/Users/alamb/Software/target-df/debug/datafusion-cli`
DataFusion CLI v23.0.0
❯ CREATE EXTERNAL TABLE customer STORED AS CSV DELIMITER '|' LOCATION '/tmp/test/customer';
0 rows in set. Query took 0.290 seconds.
❯ select * from customer limit 1;
+----------+--------------------+-------------------+----------+-----------------+----------+----------+----------------------------------------------------------------+----------+
| column_1 | column_2 | column_3 | column_4 | column_5 | column_6 | column_7 | column_8 | column_9 |
+----------+--------------------+-------------------+----------+-----------------+----------+----------+----------------------------------------------------------------+----------+
| 1 | Customer#000000001 | IVhzIApeRb ot,c,E | 15 | 25-989-741-2988 | 711.56 | BUILDING | to the even, regular platelets. regular, ironic epitaphs nag e | |
+----------+--------------------+-------------------+----------+-----------------+----------+----------+----------------------------------------------------------------+----------+
1 row in set. Query took 0.065 seconds.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I wrote a test approximating
This would result in "Invalid identifier" errors, effectively claiming the column didn't exist. I verified the file existed, had the correct columns, etc... I was very confused :laughing:
Eventually I tracked this down to the schema being inferred as empty if the extension is not ".parquet", this feels unexpected
Describe the solution you'd like
Either
register_parquet
should return an error if the extension is missing, orFileFormat::infer_schema
should be more agnostic to file extensions.