datopian / datahub-qa

:package: Bugs, issues and suggestions for datahub.io
https://datahub.io/
32 stars 6 forks source link

Publishing excel files fails validation #119

Closed zelima closed 6 years ago

zelima commented 6 years ago

Tried to push simple excel file, got the validation error on showcase page

image

How to reproduce

Expected behaviour

Publishes without any errors ...

anuveyatsu commented 6 years ago

@zelima it seems like the problem is in difference between JS and Python when "string" to "number" coercion happens. You can see that on the showcase page error says:

Failed to cast row: Field "number" can't cast value "1.0" for type "integer" with format "default"

however, the value is 1 but it is being converted into 1.0. Can you please take a look how it works in the backend?

zelima commented 6 years ago

Ok so diving deeply into this I've got the following:

Excel treats all numbers as floats. In general, it doesn't care whether your_number % 1 == 0.0 is true or not. https://stackoverflow.com/a/8826770/6056362

Also, similar problem with openpyxl used by tabulator to parse .xlsx data

iter_worksheet casts all numeric values as floats. https://bitbucket.org/openpyxl/openpyxl/issues/131/incorect-reading-integer-in-use_iterators#comment-10399143

Meaning that any number coming from Excel is float and python parses it as float, So when stream_remote_resources complains can't cast value "1.0" for type "integer" it actually complains correctly.

On the other hand we have XLSX library from NPM used in CLI to infer a schema for later use. XLSX library treats number got from excel as JS number (Number(n)) https://www.npmjs.com/package/xlsx#data-types

Playing around a bit with JS number I've got the following:

let a = Number(1)
let b = Number(1.0)

console.log('Number(1)='+a, '|', 'Number(1.0)='+b)
>>> Number(1)=1 | Number(1.0)=1
console.log(a === b)
>>> true
console.log(a == b)
>>> true

In any case Number(n) returns "number". Meaning when data.js reads Excel workbook used in datahub-client it gets list of JS numbers. That list of rows (and not path to the file or file buffer) is later used to infer schema by tableschema-js, Leaving us with type: number if there is not at least one actual float Eg 1.1

So Think if even the excel file would like this

one,two,three
1.0,2.0,3.0
4.0,5.0,6.0
...

We would infer the same exact schema. (Not sure how CSV parser works when it detects correctly)

As a solution, I suggest forcing type integer to number if data is xls(x) somewhere here https://github.com/datahq/datahub-client/blob/master/lib/utils/datahub.js#L436

@anuveyatsu @akariv what do you think?

anuveyatsu commented 6 years ago

@zelima it sounds OK to me, I couldn't think any other solution for now.

AcckiyGerman commented 6 years ago

Tested on testing.datahub.io

ERROR :Failed to cast row: Field "number" can't cast value "1.0" for type "integer" with format "default"
Mikanebu commented 6 years ago

@AcckiyGerman It has not fixed yet. I will make a PR on datahub-client.

zelima commented 6 years ago

should be fixe in data v0.8.5

AcckiyGerman commented 6 years ago

@zelima I think this does not work :(

user@pc:~/work/frictionlessdata/test-data/files/excel$ data push sample-1-sheet.xls --name excel-test --api http://api-testing.datahub.io
Completed: datapackage.json
  Uploading [******************************] 100% (0.0s left)   datapackage.json🙌  your data is published!
🔗  https://datahub.io/AcckiyGerman/excel-test/v/1 (copied to clipboard)

user@pc:~/work/frictionlessdata/test-data/files/excel$ data -v
0.8.6

image

AcckiyGerman commented 6 years ago

May be the test excel file should have true and false instead of 1 and 0 ?

AcckiyGerman commented 6 years ago
The full list of error messages:
- The value 1 in row 2 and column 3 is not type boolean and format default
- The value 0 in row 3 and column 3 is not type boolean and format default

Again the inferred schema is not valid for tableschema-py hmmm

zelima commented 6 years ago

@AcckiyGerman hm... One more headache... It seems xlrd module (used in tabulator to stream xls https://github.com/frictionlessdata/tabulator-py/blob/master/tabulator/parsers/xls.py) parses booleans from excel as 1 and 0. https://stackoverflow.com/a/22245269/6056362

zelima commented 6 years ago

Did some more analysis and suggested fix here, let's wait for response https://github.com/frictionlessdata/tabulator-py/issues/230#issuecomment-373613844

zelima commented 6 years ago

This should now be fixed

AcckiyGerman commented 6 years ago

TESTED & FIXED:

(only one small typo - integer is treated as a number) but it is ok I think image

zelima commented 6 years ago

@AcckiyGerman that is expected see this comment https://github.com/datahq/datahub-qa/issues/119#issuecomment-371138172

(only one small typo - integer is treated as a number) but it is ok I think