Ensure that blanks in spreadsheets in numerical columns convert to 0

artgoldberg commented 6 years ago

my file "Cash transactions 2017" is a counter-example

jonrkarr commented 6 years ago

I think the desired behavior is to interpret an empty cell as NaN, e.g. Empty Excel value --> python float('nan')

This is what obj_model has been doing. I added another test to confirm this.

Can we close this issue?

artgoldberg commented 6 years ago

I see the argument for that -- an empty cell certainly isn't a number. But the Excel and OO Calc default treats blank numeric cells as 0. I've tested this with sum(), +, and *.

jonrkarr commented 6 years ago

Let me look for another way to express 'None' in Excel

On Mar 12, 2018 9:30 AM, "artgoldberg" notifications@github.com wrote:

I see the argument for that -- an empty cell certainly isn't a number. But the Excel and OO Calc default treats blank numeric cells as 0. I've tested this with sum(), +, and *.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/KarrLab/obj_model/issues/8#issuecomment-372309589, or mute the thread https://github.com/notifications/unsubscribe-auth/ACt2KVQl1BdCNutIMkOGJmRsWShau6z7ks5tdnhcgaJpZM4SlTrM .

jonrkarr commented 6 years ago

I think we need to be able to represent NaN/None in Excel. For example, we need NaN/None to differentiate between an observed 0 value and a value that was not observed.

Thus far, obj_model has been using an empty Excel value to mean NaN/None. For the most part, this seems to be the meaning of an empty cell. For example, openpyxl converts all empty cells into None.

However, in some contexts such as a sum Excel interpret empty is 0.

I see two options to resolve the ambiguity:

Use an error code as a #N/A to explicitly represent Python None. This would mean replacing all empty cells that should be interpreted as Python floats with #N/A. The disadvantage of this is that #N/A really means "Argument or function not available".
Definite a special value such as __None__ to represent Python None. This would mean replacing all empty cells that should be interpreted as Python floats with __None__. The disadvantage of this is that its cumbersome.

I prefer sticking with the current approach of empty-->None, but I'm open to these other approaches. Thoughts?

Jonathan

On Mon, Mar 12, 2018 at 9:53 AM, Jonathan Karr jonrkarr@gmail.com wrote:

Let me look for another way to express 'None' in Excel

On Mar 12, 2018 9:30 AM, "artgoldberg" notifications@github.com wrote:

I see the argument for that -- an empty cell certainly isn't a number. But the Excel and OO Calc default treats blank numeric cells as 0. I've tested this with sum(), +, and *.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/KarrLab/obj_model/issues/8#issuecomment-372309589, or mute the thread https://github.com/notifications/unsubscribe-auth/ACt2KVQl1BdCNutIMkOGJmRsWShau6z7ks5tdnhcgaJpZM4SlTrM .

artgoldberg commented 6 years ago

I see the importance in our work of being able to distinguish between missing data and 0. Virtually all uses of float fields in wc_lang need this semantics.

On the other hand, spreadsheet semantics consider an empty cell to be 0 for many operations, including, sum(), unary and binary math, and binary comparisons.

Given these two competing interpretations any solution we choose should interpret empty cells as missing data by default, but give users an option to automatically interpret them as 0. I'm not comfortable with your options, but see a few alternatives beyond them. I work my way up the stack:

Support the alternative interpretations in wc_util.workbook.io or obj_model.io: not feasible, because the code doesn't know a field's obj_model type.
Support them in obj_model.io: possible, but awkward; a interpret_blank_numerics_as_zero option to Reader() could be passed to read_sheet() as a keyword and then to attr.deserialize(attr_value) (line 483) which would implement the semantics. This would require adding a kwargs argument to all appropriate subclasses of Attribute(), which wouldn't be bad. And it would provide coarse and complete control over the interpretation of all numeric blanks in a spreadsheet.
Support them in obj_model.core: we could add an option to FloatAttribute (or NumericAttribute) called interpret_blank_as_zero which would control FloatAttribute.deserialize().

One wrinkle is that an interpret_blank_as_zero option interferes with reproducible round-tripping data from spreadsheet to Python objects and back. I wouldn't worry about this.

I prefer 3, which provides fine-grain control, is easy to implement, and might be useful to us and other users of obj_model some day. But I don't think this is very important either way.

What do you think? Arthur

jonrkarr commented 6 years ago

The first two options don't work well because lower down in the stack there's no information about what type is expected. I'll implement the last option. I'll add an attribute to the obj_model.core.Attribute class to control the interpretation of empty Excel values. This seems to be a similar approach to Excel. The lowest level treats empty as Null, but the higher level formulas such as sum take a different interpretation.

I haven't been that concerned about this because formulas such as sum seem to be the only thing that interpret empty cells as 0, and obj_model doesn't support formulas (which is now enforced by the update I made yesterday per one of the other issues).

Jonathan

On Mon, Mar 12, 2018 at 9:55 PM artgoldberg notifications@github.com wrote:

I see the importance in our work of being able to distinguish between missing data and 0. Virtually all uses of float fields in wc_lang need this semantics.

On the other hand, spreadsheet semantics consider an empty cell to be 0 for many operations, including, sum(), unary and binary math, and binary comparisons.

Given these two competing interpretations any solution we choose should interpret empty cells as missing data by default, but give users an option to automatically interpret them as 0. I'm not comfortable with your options, but see a few alternatives beyond them. I work my way up the stack:

Support the alternative interpretations in wc_util.workbook.io or obj_model.io: not feasible, because the code doesn't know a field's obj_model type.

Support them in obj_model.io: possible, but awkward; a interpret_blank_numerics_as_zero option to Reader() could be passed to read_sheet() as a keyword and then to attr.deserialize(attr_value) (line 483) which would implement the semantics. This would require adding a kwargs argument to all appropriate subclasses of Attribute(), which wouldn't be bad. And it would provide coarse and complete control over the interpretation of all numeric blanks in a spreadsheet.

Support them in obj_model.core: we could add an option to FloatAttribute (or NumericAttribute) called interpret_blank_as_zero which would control FloatAttribute.deserialize().

One wrinkle is that an interpret_blank_as_zero option interferes with reproducible round-tripping data from spreadsheet to Python objects and back. I wouldn't worry about this.

I prefer 3, which provides fine-grain control, is easy to implement, and might be useful to us and other users of obj_model some day. But I don't think this is very important either way.

What do you think? Arthur

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/KarrLab/obj_model/issues/8#issuecomment-372520194, or mute the thread https://github.com/notifications/unsubscribe-auth/ACt2KcXhSZPiijRYzHcAw1fiqmzC52Ywks5tdycFgaJpZM4SlTrM .

jonrkarr commented 6 years ago

I added an attribute default_cleaned_value to allow you to control how each attribute interprets empty cells. I have set this so that obj_model because as before where empty is convert to None, except NaN for float attributes and False for Boolean attributes.

If you're happy with this solution, let's close the issue.

On Mon, Mar 12, 2018 at 10:06 PM, Jonathan Karr jonrkarr@gmail.com wrote:

The first two options don't work well because lower down in the stack there's no information about what type is expected. I'll implement the last option. I'll add an attribute to the obj_model.core.Attribute class to control the interpretation of empty Excel values. This seems to be a similar approach to Excel. The lowest level treats empty as Null, but the higher level formulas such as sum take a different interpretation.

I haven't been that concerned about this because formulas such as sum seem to be the only thing that interpret empty cells as 0, and obj_model doesn't support formulas (which is now enforced by the update I made yesterday per one of the other issues).

Jonathan

On Mon, Mar 12, 2018 at 9:55 PM artgoldberg notifications@github.com wrote:

I see the importance in our work of being able to distinguish between missing data and 0. Virtually all uses of float fields in wc_lang need this semantics.

On the other hand, spreadsheet semantics consider an empty cell to be 0 for many operations, including, sum(), unary and binary math, and binary comparisons.

Given these two competing interpretations any solution we choose should interpret empty cells as missing data by default, but give users an option to automatically interpret them as 0. I'm not comfortable with your options, but see a few alternatives beyond them. I work my way up the stack:

Support the alternative interpretations in wc_util.workbook.io or obj_model.io: not feasible, because the code doesn't know a field's obj_model type.

Support them in obj_model.io: possible, but awkward; a interpret_blank_numerics_as_zero option to Reader() could be passed to read_sheet() as a keyword and then to attr.deserialize(attr_value) (line 483) which would implement the semantics. This would require adding a kwargs argument to all appropriate subclasses of Attribute(), which wouldn't be bad. And it would provide coarse and complete control over the interpretation of all numeric blanks in a spreadsheet.

Support them in obj_model.core: we could add an option to FloatAttribute (or NumericAttribute) called interpret_blank_as_zero which would control FloatAttribute.deserialize().

One wrinkle is that an interpret_blank_as_zero option interferes with reproducible round-tripping data from spreadsheet to Python objects and back. I wouldn't worry about this.

I prefer 3, which provides fine-grain control, is easy to implement, and might be useful to us and other users of obj_model some day. But I don't think this is very important either way.

What do you think? Arthur

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/KarrLab/obj_model/issues/8#issuecomment-372520194, or mute the thread https://github.com/notifications/unsubscribe-auth/ACt2KcXhSZPiijRYzHcAw1fiqmzC52Ywks5tdycFgaJpZM4SlTrM .

artgoldberg commented 6 years ago

Thanks! I'll check on it by tomorrow.

KarrLab / obj_tables

Ensure that blanks in spreadsheets in numerical columns convert to 0 #8