Closed zaneselvans closed 5 years ago
@zaneselvans Thanks for submitting the issue. Seems like whole data has been changed. We will investigate and get back to you soon.
@zelima what happened here?
My apologies for such late response, somehow this disappeared from my sight
Can't find the analysis but datapackage-py, that is processing the data, is both inferring and casting that kind of values as integers:
cat data.csv
test1
0001
5
112
000
>>> from datapackage import Package
>>> package = Package()
>>> package.infer('data.csv')
{'profile': 'tabular-data-package', 'resources': [{'path': 'data.csv', 'profile': 'tabular-data-resource', 'name': 'data', 'format': 'csv', 'mediatype': 'text/csv', 'encoding': 'utf-8', 'schema': {'fields': [{'name': 'test1', 'type': 'integer', 'format': 'default'}], 'missingValues': ['']}}]}
>>> package.get_resource('data').read(keyed=True)
[{'test1': 1}, {'test1': 5}, {'test1': 112}, {'test1': 0}]
I'm not sure this is unexpected behavior or no
Also, before this goes to datapackage-py and to backend generally, to process the data, think we are using tableschema-js infer method from CLI. Think this is the line responsible for creating the schema https://github.com/datahq/datahub-client/blob/1d65873aa54279ab624eb7dbdbdb29bc15a9b875/lib/utils/datahub.js#L443
But again not sure whether this is wrong behavior or no
The workaround here would be to package first and explicitly set field type(s) to string in datapackage.json and publish as a datapackage (not just single file).
The datapackage which I prepared did explicitly specify that PRIMARY_SIC_CD_1
and PRIMARY_SIC_CD_SFX
were string fields, and it seems like if the type is explicitly set by the package author they probably should never be overridden, no?
@zaneselvans yes they should not be overridden - exactly the reason I suggested to explicitly set field type(s).
Can I take a look at the original data (or a small part of it)?
after, I published the data we've got in rawstore
and getting 0.0 again.
data in rawstore
should be the clone of your data. I've checked a few lines of it and values are already set to 0.0 there. So either something strange happens when uploading the data (before it get's processed) or the original data already looks like that, which I doubt. Would be great if I have small chunk of the original data to debug
Oh jeez, I'm sorry. Pulling a sample of the data post-processed vs. pre-processed data on my end (which obvs I should have before) I now see that it's my initial pd.read_csv()
which is type-naive, and is casting these number-like string columns to numbers (00
-> 0.0
), after which point I'm sure the floating point number is being stored as a "string" but not the original string value.
So this issue can probably just be closed.
Happy we resolved this :)
closing as INVALID: as mentioned above the data was already containing 0.0 before it got to Datahub.
When pushing a tabular data resource to DataHub, fields which are specified to be strings, but which contain numeric information, appear to be getting re-cast as numbers in the process that generates the tabular previews. Sometimes a column which contains numeric information needs to be treated as a string because it's a code that has leading zeroes that are meaningful (for instance).
For an example, see the tabular previews and field definitions in my pudl-msha data package, and look at the following fields:
PRIMARY_SIC_CD_1
PRIMARY_SIC_CD_SFX
Those two fields are supposed to be strings (that happen only to contain numbers) which when concatenated yield the value in the
PRIMARY_SIC_CD
field (which can be stored as a number, but maybe shouldn't be)In the preview, these two fields have been interpreted to be decimal numeric values. This means that instead of '00' for the suffix, you get 0.0 and instead of '02' you get 2.0, etc. See the attached screenshots.
Interestingly, the
ASSESS_CTRL_NO
field does not suffer this fate. I suspect that this is because it contains some values which are impossible to represent as numbers. This suggests to me that there's aninfer()
like operation taking place somewhere in the processing pipeline that doesn't respect the explicitly stated column datatype being passed in via the field definitions.