Closed mcarans closed 4 years ago
I make this more general. datapackage format recognition needs fixing for other urls eg. http://proxy.hxlstandard.org/data.csv?url=http%3A//popstats.unhcr.org/en/persons_of_concern.hxl&filter01=select&select-query01-01=%23country%2Bresidence=British%20Virgin%20Islands
It should be able to find the "csv" in the above and in the Google spreadsheet cases.
In resource.py line 468 it does this:
inspection['format'] = os.path.splitext(filename)[1][1:]
which will only handle simple cases. It could for example do os.path.splitext(filename)[1][1:]
, check whether the extension is 3 characters and if it is not, try also looking for "=csv" and ".csv" (and for other tabular extensions ['csv', 'tsv', 'xls', 'xlsx']
) in the url and if that fails, go back to using os.path.splitext(filename)[1][1:]
@mcarans
I think we should mark it as WONTFIX for now. I suppose there are countless cases when guessing is not trivial or impossible. So that's the reason why Data Resource specification has resource.format
property. After #189 this property should work as expected.
Yes tabulator
does a little bit more guessing on format but that's the purpose of the tabulator
- to be a army-knife for tabular data. On other hand on the datapackage
level we encourage users to provide metadata like resource.format
to not rely on guessing to much.
WDYT?
https://github.com/frictionlessdata/datapackage-py/pull/189 will fix the issue I am having but I think the format guessing part of Tabulator should be shared with datapackage rather than having two implementations that produce different results as otherwise it is confusing (particularly as Tabulator is a dependency).
@mcarans
I agree. I've turned this issue into an enhancement request based on writing tabulator.infer
function.
FIXED in Frictionless Framework
Overview
For now
resource.infer
does just basic format inferring (seeresource.py._inspect_source
function). But we'd like to expand this behaviour re-usingtabulator.infer
function (which we should write first).Plan
tabulator.infer(source) -> {tabular, scheme, format}
functiontabulator.validate
in favor of this functionResource
class format infer logic on this functionFrom @mcarans
https://docs.google.com/spreadsheets/d/1paoIpHiYo7dy_dnf_luUSfowWDwNAWwS3z4GHL2J7Rc/export?format=csv&id=1paoIpHiYo7dy_dnf_luUSfowWDwNAWwS3z4GHL2J7Rc&gid=215078761 does not get recognised as a csv and hence inferring does not pull out the schema.