datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
195 stars 40 forks source link

Don't convert NoneType to string when cast/infer_strategy is string #116

Closed cschloer closed 4 years ago

cschloer commented 4 years ago

Currently when cast_strategy and infer_strategy are set to string in the load flow it also converts NoneTypes to the string None. This isn't a problem for csv's where empty values are communicated through the empty string, but for excel files with empty cells the openpyxl parser returns the value None. I added a line and v is not None to the stringer function in load so that values of type None remain so.

cschloer commented 4 years ago

Example file: cast_string_none.xlsx

coveralls commented 4 years ago

Pull Request Test Coverage Report for Build 402


Totals Coverage Status
Change from base Build 397: 0.0%
Covered Lines: 1649
Relevant Lines: 1949

💛 - Coveralls
cschloer commented 4 years ago

Added tests

akariv commented 4 years ago

Thanks @cschloer

After #115, I wonder if this would still be a problem (now released in dataflows@0.0.64)

This PR changes the behavior so that infer_strings also implies force_strings=True.

In tabulator, this causes None values to be interpreted as the empty string:

>>> import tabulator
>>> s=tabulator.Stream('https://github.com/datahq/dataflows/files/3851313/cast_string_none.xlsx', force_strings=True).open()
>>> list(iter(s))
[['Species', 'Age (days post hatch)', 'Size (mm total length or standard length)', 'Individual'],
  ['E. lori', '0', '3', '']]

So dataflows will never get these Nones.

On a side note, in these cases you could simple use 'cast=nothing' to keep the types as-is with a strings-only schema.

cschloer commented 4 years ago

This is indeed fixed in the most reason version :) Thanks @akariv @roll