frictionlessdata / datapackage-py

A Python library for working with Data Packages.
https://frictionlessdata.io
MIT License
191 stars 43 forks source link

Passed in format not used in format recognition #188

Closed mcarans closed 6 years ago

mcarans commented 7 years ago

The format is passed as csv in the below in add_resource but in resource.py line 468 it does this:

        inspection['format'] = os.path.splitext(filename)[1][1:]

ie. it ignores the passed in format and effectively overwrites it.

eg.

from datapackage import Package

package = Package({'title': "UNHCR's populations of concern residing in the British Virgin Islands", 'name': 'refugees-residing-vgb', 'id': '4d801156-3b5c-4867-97b8-da5a584ad0dc', 'description': "Data"})
package.add_resource({'encoding': 'utf-8', 'path': 'http://proxy.hxlstandard.org/data.csv?url=http%3A//popstats.unhcr.org/en/persons_of_concern.hxl&filter01=select&select-query01-01=%23country%2Bresidence=British%20Virgin%20Islands', 'title': 'UNHCR persons of concern residing in the British Virgin Islands.', 'format': 'csv', 'name': 'unhcr_persons_of_concern_residence_vgb.csv'})
package.infer()
roll commented 6 years ago

Released as datapackage@1.1.4

mcarans commented 6 years ago

@roll This is fixed, but stepping through the code is confusing as the inferring part is still called but ignored when the format is passed in. IMHO, it would be better to avoid calling the inferring of format altogether for clarity. Perhaps this is something you could look into as you introduce Tabulator's inferring?

roll commented 6 years ago

@mcarans It's intentionally for now. Concept is that there are two information sources:

It's independent things and descriptor's metadata have a precedence over source inspection.

mcarans commented 6 years ago

@roll If the format is passed in, is there any circumstance in which the source inspection format would be used?

roll commented 6 years ago

@mcarans No. Output from _inspect_source it's kind a fallback metadata (we get it on every resource build fully but just for connivence and because it's a cheap operation) and resource.descriptor always has precedence.