Closed zelima closed 6 years ago
@zelima it seems like the problem is in difference between JS and Python when "string" to "number" coercion happens. You can see that on the showcase page error says:
Failed to cast row: Field "number" can't cast value "1.0" for type "integer" with format "default"
however, the value is 1
but it is being converted into 1.0
. Can you please take a look how it works in the backend?
Ok so diving deeply into this I've got the following:
Excel treats all numbers as floats. In general, it doesn't care whether your_number % 1 == 0.0 is true or not. https://stackoverflow.com/a/8826770/6056362
Also, similar problem with openpyxl used by tabulator to parse .xlsx
data
iter_worksheet casts all numeric values as floats. https://bitbucket.org/openpyxl/openpyxl/issues/131/incorect-reading-integer-in-use_iterators#comment-10399143
Meaning that any number coming from Excel is float
and python parses it as float
, So when stream_remote_resources
complains can't cast value "1.0" for type "integer"
it actually complains correctly.
On the other hand we have XLSX
library from NPM used in CLI to infer a schema for later use. XLSX
library treats number got from excel as JS number (Number(n)
) https://www.npmjs.com/package/xlsx#data-types
Playing around a bit with JS number I've got the following:
let a = Number(1)
let b = Number(1.0)
console.log('Number(1)='+a, '|', 'Number(1.0)='+b)
>>> Number(1)=1 | Number(1.0)=1
console.log(a === b)
>>> true
console.log(a == b)
>>> true
In any case Number(n)
returns "number". Meaning when data.js reads Excel workbook used in datahub-client it gets list of JS numbers. That list of rows (and not path to the file or file buffer) is later used to infer schema by tableschema-js, Leaving us with type: number
if there is not at least one actual float Eg 1.1
So Think if even the excel file would like this
one,two,three
1.0,2.0,3.0
4.0,5.0,6.0
...
We would infer the same exact schema. (Not sure how CSV parser works when it detects correctly)
As a solution, I suggest forcing type integer
to number
if data is xls(x) somewhere here https://github.com/datahq/datahub-client/blob/master/lib/utils/datahub.js#L436
@anuveyatsu @akariv what do you think?
@zelima it sounds OK to me, I couldn't think any other solution for now.
Tested on testing.datahub.io
ERROR :Failed to cast row: Field "number" can't cast value "1.0" for type "integer" with format "default"
@AcckiyGerman It has not fixed yet. I will make a PR on datahub-client.
should be fixe in data v0.8.5
@zelima I think this does not work :(
user@pc:~/work/frictionlessdata/test-data/files/excel$ data push sample-1-sheet.xls --name excel-test --api http://api-testing.datahub.io
Completed: datapackage.json
Uploading [******************************] 100% (0.0s left) datapackage.json🙌 your data is published!
🔗 https://datahub.io/AcckiyGerman/excel-test/v/1 (copied to clipboard)
user@pc:~/work/frictionlessdata/test-data/files/excel$ data -v
0.8.6
May be the test excel file should have true
and false
instead of 1
and 0
?
The full list of error messages:
- The value 1 in row 2 and column 3 is not type boolean and format default
- The value 0 in row 3 and column 3 is not type boolean and format default
Again the inferred schema is not valid for tableschema-py
hmmm
@AcckiyGerman hm... One more headache... It seems xlrd module (used in tabulator to stream xls https://github.com/frictionlessdata/tabulator-py/blob/master/tabulator/parsers/xls.py) parses booleans from excel as 1 and 0. https://stackoverflow.com/a/22245269/6056362
Did some more analysis and suggested fix here, let's wait for response https://github.com/frictionlessdata/tabulator-py/issues/230#issuecomment-373613844
This should now be fixed
TESTED & FIXED:
(only one small typo - integer is treated as a number) but it is ok I think
@AcckiyGerman that is expected see this comment https://github.com/datahq/datahub-qa/issues/119#issuecomment-371138172
(only one small typo - integer is treated as a number) but it is ok I think
Tried to push simple excel file, got the validation error on showcase page
How to reproduce
Expected behaviour
Publishes without any errors ...