DataHub tabular previews appear to re-cast column types incorrectly

zaneselvans commented 6 years ago

When pushing a tabular data resource to DataHub, fields which are specified to be strings, but which contain numeric information, appear to be getting re-cast as numbers in the process that generates the tabular previews. Sometimes a column which contains numeric information needs to be treated as a string because it's a code that has leading zeroes that are meaningful (for instance).

For an example, see the tabular previews and field definitions in my pudl-msha data package, and look at the following fields:

PRIMARY_SIC_CD_1
PRIMARY_SIC_CD_SFX

Those two fields are supposed to be strings (that happen only to contain numbers) which when concatenated yield the value in the PRIMARY_SIC_CD field (which can be stored as a number, but maybe shouldn't be)

In the preview, these two fields have been interpreted to be decimal numeric values. This means that instead of '00' for the suffix, you get 0.0 and instead of '02' you get 2.0, etc. See the attached screenshots.

Interestingly, the ASSESS_CTRL_NO field does not suffer this fate. I suspect that this is because it contains some values which are impossible to represent as numbers. This suggests to me that there's an infer() like operation taking place somewhere in the processing pipeline that doesn't respect the explicitly stated column datatype being passed in via the field definitions.

numeric-string-field-defs numeric-string-to-float-cast

zelima commented 6 years ago

@zaneselvans Thanks for submitting the issue. Seems like whole data has been changed. We will investigate and get back to you soon.

rufuspollock commented 5 years ago

@zelima what happened here?

zelima commented 5 years ago

My apologies for such late response, somehow this disappeared from my sight

Can't find the analysis but datapackage-py, that is processing the data, is both inferring and casting that kind of values as integers:

cat data.csv 
test1
0001
5
112
000

>>> from datapackage import Package
>>> package = Package()
>>> package.infer('data.csv')
{'profile': 'tabular-data-package', 'resources': [{'path': 'data.csv', 'profile': 'tabular-data-resource', 'name': 'data', 'format': 'csv', 'mediatype': 'text/csv', 'encoding': 'utf-8', 'schema': {'fields': [{'name': 'test1', 'type': 'integer', 'format': 'default'}], 'missingValues': ['']}}]}
>>> package.get_resource('data').read(keyed=True)
[{'test1': 1}, {'test1': 5}, {'test1': 112}, {'test1': 0}]

I'm not sure this is unexpected behavior or no

zelima commented 5 years ago

Also, before this goes to datapackage-py and to backend generally, to process the data, think we are using tableschema-js infer method from CLI. Think this is the line responsible for creating the schema https://github.com/datahq/datahub-client/blob/1d65873aa54279ab624eb7dbdbdb29bc15a9b875/lib/utils/datahub.js#L443

But again not sure whether this is wrong behavior or no

zelima commented 5 years ago

The workaround here would be to package first and explicitly set field type(s) to string in datapackage.json and publish as a datapackage (not just single file).

zaneselvans commented 5 years ago

The datapackage which I prepared did explicitly specify that PRIMARY_SIC_CD_1 and PRIMARY_SIC_CD_SFX were string fields, and it seems like if the type is explicitly set by the package author they probably should never be overridden, no?

zelima commented 5 years ago

@zaneselvans yes they should not be overridden - exactly the reason I suggested to explicitly set field type(s).

Can I take a look at the original data (or a small part of it)?

I've just published this dataset and it seems fine https://datahub.io/zelima/test-strings/v/2
after, I published the data we've got in rawstore and getting 0.0 again.

data in rawstore should be the clone of your data. I've checked a few lines of it and values are already set to 0.0 there. So either something strange happens when uploading the data (before it get's processed) or the original data already looks like that, which I doubt. Would be great if I have small chunk of the original data to debug

zaneselvans commented 5 years ago

Oh jeez, I'm sorry. Pulling a sample of the data post-processed vs. pre-processed data on my end (which obvs I should have before) I now see that it's my initial pd.read_csv() which is type-naive, and is casting these number-like string columns to numbers (00 -> 0.0), after which point I'm sure the floating point number is being stored as a "string" but not the original string value.

So this issue can probably just be closed.

zelima commented 5 years ago

Happy we resolved this :)

closing as INVALID: as mentioned above the data was already containing 0.0 before it got to Datahub.

datopian / datahub-qa

DataHub tabular previews appear to re-cast column types incorrectly #241