eclipse-epsilon / epsilon

Epsilon is a family of Java-based scripting languages for automating common model-based software engineering tasks, such as code generation, model-to-model transformation and model validation, that work out of the box with EMF (including Xtext and Sirius), UML (including Cameo/MagicDraw), Simulink, XML and other types of models.
https://eclipse.org/epsilon
Eclipse Public License 2.0
67 stars 10 forks source link

[Excel] Numeric cell value takes precedence over configured column datatype #89

Closed agarciadom closed 7 months ago

agarciadom commented 7 months ago

Kostas and I were today puzzling over some odd behaviour while using the Excel driver. We noticed that a column with numeric IDs (e.g. 123456) was being parsed by the driver as having floating-point numbers, even though we had indicated in the config.xml that its datatype was string. We had to instead use integer as its datatype, so we'd get all digits properly and we could do appropriate ID matching.

Looking at the code in ExcelRow, it seems it checks the type detected by POI first, and then it checks the declared datatype:

https://github.com/eclipse/epsilon/blob/428c241941979819d7fe3d133d387cd306fe2e4e/plugins/org.eclipse.epsilon.emc.spreadsheets.excel/src/org/eclipse/epsilon/emc/spreadsheets/excel/ExcelRow.java#L38-L58

This means that even if someone says that a column should be treated as text, we'll still treat it as a numeric if POI considers it to be a numeric value. Is this the expected behaviour? I think it may confuse users if we do not honour their declared preference. On the other hand, I'm aware that by flipping the order of the checks (first declared datatype then POI type) we may be breaking some scripts, as the default datatype is actually string.

Any thoughts?

kolovos commented 7 months ago

I agree - the declared type should trump the auto-detected one.

agarciadom commented 7 months ago

I've pushed a fix for this. Initially I tried changing the logic to always use the declared type, but that did not handle many-valued columns well, so instead I've made it so cells with NUMERIC values take into account properly the situation where the column is declared as having a string type.

For the sake of completeness, I also handle the case where the cell has a NUMERIC value but the declared datatype is boolean: in this case, zero values are treated as false, and non-zero values are treated as true.

I had to change the expected values for two tests here, based on the updated behaviour. I also added a few tests which all try to load the same minimal spreadsheet, but with a different declared datatype.

I'll update the documentation now.