Open randerzander opened 2 years ago
The logic today for inferring file formats if not explicitly provided is just checking the file extension using the following: https://github.com/dask-contrib/dask-sql/blob/c3ad6a9f6b01ce02127fde7501eaf322c8160f7e/dask_sql/input_utils/location.py#L41-L44
I believe some work can be done to improve the error message when an extension was not provided in the file path and encourage users to explicitly provide this value.
We can also explore adding more checks around if the input is a directory and if so trying to infer the format from one of the files within that directory, though I did want to mention that from initial search, other frameworks usually also expect users to provide the file format during dataset creation.
Since we're talking about improving this feature, I might be missing where the logic to handle it would actually be applied, but it doesn't look like these checks include handling Dask's _metadata
or Spark's _SUCCESS
file if they're in the directory?
We can also explore adding more checks around if the input is a directory and if so trying to infer the format from one of the files within that directory
This sounds like a nicer user experience than asking people to type .../*.parquet
which is about as much typing as including the `format = 'parquet' arg.
Trace: