Open fjuniorr opened 4 months ago
Hi, thanks for the report and reproducible example.
In your example, I get indeed dialect.delimiter
to |
for the first test (with data.csv
), but not with the sample or with explicitely specified delimiter (I guess that was intended).
I am not very familiar with the frictionless codebase, so sorry if I'm off base, I am guessing a bit. My understanding is that when it is not explicitly set, frictionless makes a guess about the delimiter being |
(not the best, I agree), and infer
requires to do so, as without delimiter there are e.g. no possible stats about the number of fields. I observe the same delimiter if I resource.read_rows()
instead of infer
(with a TableResource
)
So do I understand correctly that the issue is not that it changes any behaviour, but that it explicitely sets the delimiter in the resource? If yes, why is it an issue?
I did not notice that resource.read_rows()
was also inferring the delimiter of data.csv
to be |
.
In general[^1] I would expect frictionless-py
to use ,
as delimiter for operations such as read_rows
because ,
is the default value per the data package standard. The wording in CSV Dialect is even more explicit:
delimiter
- specifies the character sequence which should separate fields (aka columns). Default =,
. Example\t
. If not present, consumers should assume that it’s,
.
For my particular use case the inference behaviour is problematic because I have several descriptors in which I do not explicitly set delimiter
when the proper value is ,
. Mostly it works just fine except when it breaks.
[^1]: The exception would be in functionality that is explicitly created for inference purposes, such as describe
and maybe infer
(which was my original question).
Ok thanks, I understand what bothers you, and I agree that using ,
as default instead of guessing the delimiter would probably a less error-prone behaviour, and more consistent with the documentation.
I also understand why you specifically asked about infer
. Any change on this (infer
or read_rows
) would however be a breaking change.
In any case I'll try to find the time to understand the exact behaviour of the delimiter detection.
Knowing that this is also the behaviour in read_rows
I agree that changing it is not a good idea.
Maybe in the future making the delimiter detection more robust (which I'm not sure is high priority TBH).
I did not find this information in the documentation, but by extrapolating this behavior:
I assumed that
resource.infer(stats=True)
would not changedialect.delimiter
(even if it was not explicitly set).However, this is not the case, and data like:
is causing
resource.infer(stats=True)
to setdialect.delimiter
to|
.There is a reproducible example here.