Closed u8sand closed 3 years ago
Hi @u8sand
If no format is explicitly passed, then, there is no inferring - the file is parsed as csv:
Tabulator, the underlying stream processing library, would need to be passed format as None in order to infer:
So the bug is not that inference should be better, but that all files are parsed as CSV unless explicitly declared otherwise. You should be able to run a quick test to see that is what is happening, by setting format to None and see if Tabulator correctly detects TSV.
@pwalsh Ah I see, so format
was missing all along. Thank you for this; sorry for the invalid issue.
It should be noted, with this new information, that specifying format: null
also fixes it suggesting that you're right about passing None along to tabulator. Perhaps this is a more intuitive default.
The reason that CSV is the default is because CSV is the required data format for a tabular data package according to the spec. Passing a format gives you a loophole around this.
Overview
Dialect inferring should do better, in particular the file extension and first line should be very good indications of the dialect, however datapackage gets confused when using arrays in
tsv
(presumably because of the quotes/commas).Minimum "broken" example
pkg.json
table.tsv
test.py
Traceback
Note that this can be corrected by adding a dialect to the resource (
#/resources/0
):But since we're using
.tsv
and don't have commas in the header, it really seems like this should be inferred. Instead it seems like a "csv" is assumed and you get errors. The error reported makes this very hard to debug in code that worked perfectly fine before trying to use thearray
type.Please preserve this line to notify @pwalsh (lead of this repository)