frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
https://framework.frictionlessdata.io
MIT License
710 stars 148 forks source link

`resource.infer(stats=True)` changes value of `dialect.delimiter` depending on resource data #1671

Open fjuniorr opened 4 months ago

fjuniorr commented 4 months ago

I did not find this information in the documentation, but by extrapolating this behavior:

Also, won't resource.infer() reset the whole schema of the resource?

No, it will not change existing properties except for the recalculation of resource.stats.

-- https://github.com/frictionlessdata/frictionless-py/issues/450#issuecomment-704747085

I assumed that resource.infer(stats=True) would not change dialect.delimiter (even if it was not explicitly set).

However, this is not the case, and data like:

pkey,year,code,desc
2008|1001,2008,1001,PROGRAMA LARES HABITACAO POPULAR
2008|1003,2008,1003,DIVULGACAO DE MINAS GERAIS COMO ESTADO DESCOMPLICADO

is causing resource.infer(stats=True) to set dialect.delimiter to |.

There is a reproducible example here.

pierrecamilleri commented 2 months ago

Hi, thanks for the report and reproducible example.

In your example, I get indeed dialect.delimiter to | for the first test (with data.csv), but not with the sample or with explicitely specified delimiter (I guess that was intended).

I am not very familiar with the frictionless codebase, so sorry if I'm off base, I am guessing a bit. My understanding is that when it is not explicitly set, frictionless makes a guess about the delimiter being | (not the best, I agree), and infer requires to do so, as without delimiter there are e.g. no possible stats about the number of fields. I observe the same delimiter if I resource.read_rows() instead of infer (with a TableResource)

So do I understand correctly that the issue is not that it changes any behaviour, but that it explicitely sets the delimiter in the resource? If yes, why is it an issue?

fjuniorr commented 2 months ago

I did not notice that resource.read_rows() was also inferring the delimiter of data.csv to be |.

In general[^1] I would expect frictionless-py to use , as delimiter for operations such as read_rows because , is the default value per the data package standard. The wording in CSV Dialect is even more explicit:

delimiter - specifies the character sequence which should separate fields (aka columns). Default = ,. Example \t. If not present, consumers should assume that it’s ,.

For my particular use case the inference behaviour is problematic because I have several descriptors in which I do not explicitly set delimiter when the proper value is ,. Mostly it works just fine except when it breaks.

[^1]: The exception would be in functionality that is explicitly created for inference purposes, such as describe and maybe infer (which was my original question).

pierrecamilleri commented 2 months ago

Ok thanks, I understand what bothers you, and I agree that using , as default instead of guessing the delimiter would probably a less error-prone behaviour, and more consistent with the documentation.

I also understand why you specifically asked about infer. Any change on this (infer or read_rows) would however be a breaking change.

In any case I'll try to find the time to understand the exact behaviour of the delimiter detection.

fjuniorr commented 2 months ago

Knowing that this is also the behaviour in read_rows I agree that changing it is not a good idea.

Maybe in the future making the delimiter detection more robust (which I'm not sure is high priority TBH).