Closed artgoldberg closed 6 years ago
I think the desired behavior is to interpret an empty cell as NaN, e.g. Empty Excel value --> python float('nan')
This is what obj_model has been doing. I added another test to confirm this.
Can we close this issue?
I see the argument for that -- an empty cell certainly isn't a number. But the Excel and OO Calc default treats blank numeric cells as 0. I've tested this with sum(), +, and *.
Let me look for another way to express 'None' in Excel
On Mar 12, 2018 9:30 AM, "artgoldberg" notifications@github.com wrote:
I see the argument for that -- an empty cell certainly isn't a number. But the Excel and OO Calc default treats blank numeric cells as 0. I've tested this with sum(), +, and *.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/KarrLab/obj_model/issues/8#issuecomment-372309589, or mute the thread https://github.com/notifications/unsubscribe-auth/ACt2KVQl1BdCNutIMkOGJmRsWShau6z7ks5tdnhcgaJpZM4SlTrM .
I think we need to be able to represent NaN/None in Excel. For example, we need NaN/None to differentiate between an observed 0 value and a value that was not observed.
Thus far, obj_model
has been using an empty Excel value to mean NaN/None.
For the most part, this seems to be the meaning of an empty cell. For
example, openpyxl converts all empty cells into None
.
However, in some contexts such as a sum
Excel interpret empty is 0.
I see two options to resolve the ambiguity:
#N/A
to explicitly represent Python None
.
This would mean replacing all empty cells that should be interpreted as
Python floats with #N/A
. The disadvantage of this is that #N/A really
means "Argument or function not available".__None__
to represent Python
None
. This would mean replacing all empty cells that should be
interpreted as Python floats with __None__
. The disadvantage of this is
that its cumbersome.I prefer sticking with the current approach of empty-->None, but I'm open to these other approaches. Thoughts?
Jonathan
On Mon, Mar 12, 2018 at 9:53 AM, Jonathan Karr jonrkarr@gmail.com wrote:
Let me look for another way to express 'None' in Excel
On Mar 12, 2018 9:30 AM, "artgoldberg" notifications@github.com wrote:
I see the argument for that -- an empty cell certainly isn't a number. But the Excel and OO Calc default treats blank numeric cells as 0. I've tested this with sum(), +, and *.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/KarrLab/obj_model/issues/8#issuecomment-372309589, or mute the thread https://github.com/notifications/unsubscribe-auth/ACt2KVQl1BdCNutIMkOGJmRsWShau6z7ks5tdnhcgaJpZM4SlTrM .
I see the importance in our work of being able to distinguish between missing data and 0. Virtually all uses of float fields in wc_lang need this semantics.
On the other hand, spreadsheet semantics consider an empty cell to be 0 for many operations, including, sum(), unary and binary math, and binary comparisons.
Given these two competing interpretations any solution we choose should interpret empty cells as missing data by default, but give users an option to automatically interpret them as 0. I'm not comfortable with your options, but see a few alternatives beyond them. I work my way up the stack:
One wrinkle is that an interpret_blank_as_zero option interferes with reproducible round-tripping data from spreadsheet to Python objects and back. I wouldn't worry about this.
I prefer 3, which provides fine-grain control, is easy to implement, and might be useful to us and other users of obj_model some day. But I don't think this is very important either way.
What do you think? Arthur
The first two options don't work well because lower down in the stack
there's no information about what type is expected. I'll implement the
last option. I'll add an attribute to the obj_model.core.Attribute
class
to control the interpretation of empty Excel values. This seems to be a
similar approach to Excel. The lowest level treats empty as Null, but the
higher level formulas such as sum take a different interpretation.
I haven't been that concerned about this because formulas such as sum seem to be the only thing that interpret empty cells as 0, and obj_model doesn't support formulas (which is now enforced by the update I made yesterday per one of the other issues).
Jonathan
On Mon, Mar 12, 2018 at 9:55 PM artgoldberg notifications@github.com wrote:
I see the importance in our work of being able to distinguish between missing data and 0. Virtually all uses of float fields in wc_lang need this semantics.
On the other hand, spreadsheet semantics consider an empty cell to be 0 for many operations, including, sum(), unary and binary math, and binary comparisons.
Given these two competing interpretations any solution we choose should interpret empty cells as missing data by default, but give users an option to automatically interpret them as 0. I'm not comfortable with your options, but see a few alternatives beyond them. I work my way up the stack:
- Support the alternative interpretations in wc_util.workbook.io or obj_model.io: not feasible, because the code doesn't know a field's obj_model type.
- Support them in obj_model.io: possible, but awkward; a interpret_blank_numerics_as_zero option to Reader() could be passed to read_sheet() as a keyword and then to attr.deserialize(attr_value) (line 483) which would implement the semantics. This would require adding a kwargs argument to all appropriate subclasses of Attribute(), which wouldn't be bad. And it would provide coarse and complete control over the interpretation of all numeric blanks in a spreadsheet.
- Support them in obj_model.core: we could add an option to FloatAttribute (or NumericAttribute) called interpret_blank_as_zero which would control FloatAttribute.deserialize().
One wrinkle is that an interpret_blank_as_zero option interferes with reproducible round-tripping data from spreadsheet to Python objects and back. I wouldn't worry about this.
I prefer 3, which provides fine-grain control, is easy to implement, and might be useful to us and other users of obj_model some day. But I don't think this is very important either way.
What do you think? Arthur
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/KarrLab/obj_model/issues/8#issuecomment-372520194, or mute the thread https://github.com/notifications/unsubscribe-auth/ACt2KcXhSZPiijRYzHcAw1fiqmzC52Ywks5tdycFgaJpZM4SlTrM .
I added an attribute default_cleaned_value
to allow you to control how
each attribute interprets empty cells. I have set this so that obj_model
because as before where empty is convert to None, except NaN for float
attributes and False for Boolean attributes.
If you're happy with this solution, let's close the issue.
On Mon, Mar 12, 2018 at 10:06 PM, Jonathan Karr jonrkarr@gmail.com wrote:
The first two options don't work well because lower down in the stack there's no information about what type is expected. I'll implement the last option. I'll add an attribute to the
obj_model.core.Attribute
class to control the interpretation of empty Excel values. This seems to be a similar approach to Excel. The lowest level treats empty as Null, but the higher level formulas such as sum take a different interpretation.I haven't been that concerned about this because formulas such as sum seem to be the only thing that interpret empty cells as 0, and obj_model doesn't support formulas (which is now enforced by the update I made yesterday per one of the other issues).
Jonathan
On Mon, Mar 12, 2018 at 9:55 PM artgoldberg notifications@github.com wrote:
I see the importance in our work of being able to distinguish between missing data and 0. Virtually all uses of float fields in wc_lang need this semantics.
On the other hand, spreadsheet semantics consider an empty cell to be 0 for many operations, including, sum(), unary and binary math, and binary comparisons.
Given these two competing interpretations any solution we choose should interpret empty cells as missing data by default, but give users an option to automatically interpret them as 0. I'm not comfortable with your options, but see a few alternatives beyond them. I work my way up the stack:
- Support the alternative interpretations in wc_util.workbook.io or obj_model.io: not feasible, because the code doesn't know a field's obj_model type.
- Support them in obj_model.io: possible, but awkward; a interpret_blank_numerics_as_zero option to Reader() could be passed to read_sheet() as a keyword and then to attr.deserialize(attr_value) (line 483) which would implement the semantics. This would require adding a kwargs argument to all appropriate subclasses of Attribute(), which wouldn't be bad. And it would provide coarse and complete control over the interpretation of all numeric blanks in a spreadsheet.
- Support them in obj_model.core: we could add an option to FloatAttribute (or NumericAttribute) called interpret_blank_as_zero which would control FloatAttribute.deserialize().
One wrinkle is that an interpret_blank_as_zero option interferes with reproducible round-tripping data from spreadsheet to Python objects and back. I wouldn't worry about this.
I prefer 3, which provides fine-grain control, is easy to implement, and might be useful to us and other users of obj_model some day. But I don't think this is very important either way.
What do you think? Arthur
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/KarrLab/obj_model/issues/8#issuecomment-372520194, or mute the thread https://github.com/notifications/unsubscribe-auth/ACt2KcXhSZPiijRYzHcAw1fiqmzC52Ywks5tdycFgaJpZM4SlTrM .
Thanks! I'll check on it by tomorrow.
my file "Cash transactions 2017" is a counter-example