frictionlessdata / datapackage-py

A Python library for working with Data Packages.
https://frictionlessdata.io
MIT License
191 stars 43 forks source link

Infer resource format using tabulator.infer #185

Closed mcarans closed 4 years ago

mcarans commented 7 years ago

Overview

For now resource.infer does just basic format inferring (see resource.py._inspect_source function). But we'd like to expand this behaviour re-using tabulator.infer function (which we should write first).

Plan


From @mcarans

https://docs.google.com/spreadsheets/d/1paoIpHiYo7dy_dnf_luUSfowWDwNAWwS3z4GHL2J7Rc/export?format=csv&id=1paoIpHiYo7dy_dnf_luUSfowWDwNAWwS3z4GHL2J7Rc&gid=215078761 does not get recognised as a csv and hence inferring does not pull out the schema.

mcarans commented 7 years ago

I make this more general. datapackage format recognition needs fixing for other urls eg. http://proxy.hxlstandard.org/data.csv?url=http%3A//popstats.unhcr.org/en/persons_of_concern.hxl&filter01=select&select-query01-01=%23country%2Bresidence=British%20Virgin%20Islands

https://docs.google.com/spreadsheets/d/e/2PACX-1vQERb8RPefeENezNf79_ZVLIhaiS17CFEoeUKvmsXOwpEbPIcR-Egmip9bfSwBaMKqSofuZkrQXHfyE/pub?gid=366319992&single=true&output=csv

It should be able to find the "csv" in the above and in the Google spreadsheet cases.

mcarans commented 7 years ago

In resource.py line 468 it does this:

        inspection['format'] = os.path.splitext(filename)[1][1:]

which will only handle simple cases. It could for example do os.path.splitext(filename)[1][1:], check whether the extension is 3 characters and if it is not, try also looking for "=csv" and ".csv" (and for other tabular extensions ['csv', 'tsv', 'xls', 'xlsx']) in the url and if that fails, go back to using os.path.splitext(filename)[1][1:]

roll commented 6 years ago

@mcarans I think we should mark it as WONTFIX for now. I suppose there are countless cases when guessing is not trivial or impossible. So that's the reason why Data Resource specification has resource.format property. After #189 this property should work as expected.

Yes tabulator does a little bit more guessing on format but that's the purpose of the tabulator - to be a army-knife for tabular data. On other hand on the datapackage level we encourage users to provide metadata like resource.format to not rely on guessing to much.

WDYT?

mcarans commented 6 years ago

https://github.com/frictionlessdata/datapackage-py/pull/189 will fix the issue I am having but I think the format guessing part of Tabulator should be shared with datapackage rather than having two implementations that produce different results as otherwise it is confusing (particularly as Tabulator is a dependency).

roll commented 6 years ago

@mcarans I agree. I've turned this issue into an enhancement request based on writing tabulator.infer function.

roll commented 4 years ago

FIXED in Frictionless Framework