akvo / akvo-flow

A data collection and monitoring tool that works anywhere.
http://akvo.org/products/akvoflow/
GNU Affero General Public License v3.0
65 stars 31 forks source link

Spreadsheet Import converts data to numbers regardless of the type of data expected by the associated question #923

Closed muloem closed 8 years ago

muloem commented 9 years ago

When importing data in a spreadsheet into the dashboard, cells containing only numbers are imported as numeric data regardless of the expected type of the of responses as defined by the question.

How to reproduce:

  1. create survey containing a free text question
  2. Export an empty raw data report of the survey
  3. enter a mix of numbers some 9+ digits long in the column for the free text question
  4. import the survey to the dashboard
  5. check the values of the imported data

Expected results: strings containing numbers formatted in scientific notation for those with longer digits.

The correct behaviour should be that the type of the data imported is based on the type of data expected by the question associated with that column in the spreadsheet.

muloem commented 8 years ago

@stellanl can you investigate to confirm whether or not this is still an issue and if yes please make appropriate fixes?

stellanl commented 8 years ago

There are three things to consider:

  1. What happens in the export. Answers for Number questions are explicitly stored as numbers in the cells, everything else is stored in string cells. No style attribute is set for the cells, so there is no explicit data formatting.
  2. What happens in the spreadsheet. In Excel/LibreOffice Calc, when you type something into a cell with no explicit formatting, you get automatic typing - a number is stored as a number (in Calc shown in scientific formatting IF the column is too narrow for all the digits). Storing as a number means that there is not unlimited precision; it may be rounded. If the cell has explicitly been formatted as text, it is stored (and shown) as text. Nothing is lost. Setting explicit/direct text formatting when we output a text field should prevent most mistakes, but the user might of course remove or change the cell formatting.
  3. What happens during import. A cell with a number is converted to text, and the rounding from the spreadsheet remains. The only way to improve data quality here is to fail imports if any cells for text questions contain numbers. This is doable, but will probably result in more complicated error messages.
stellanl commented 8 years ago

Proposed solution: set an explicit text format on all non-number answer cells during export. It should be tested by some real users in case their workflow somehow depends on the current state.

It would be trivial to add some formatting style to the number cells. For example, "Never scientific notation". Has anyone expressed a wish in that regard?

stellanl commented 8 years ago

As proposed by Josje, it would also be simple to indicate which fields must not be changed before an import, for example by by making them red-bordered or with pink background or in italics. You can also set a comment for a cell. This should be made as an option so it does not affect the raw data reports that are not used for data cleaning.

janagombitova commented 8 years ago

@stellanl That is a great idea, I did not know it was possible to hard code colour into Flow. We should create a separate issue for it as it is out of the scope of this one. And before we invest your time into making this fix, I think we should know whether users still edit the columns they should not edit, thus still have too many confusions with data cleaning. Let's consult with @Geerts

@Geerts I noticed that the 'rules of data cleaning' are not explicitly stated in Support - Can you add them under the Data cleaning article? Which rows you can not change? What question types you cannot edit?

stellanl commented 8 years ago

Test plan:

  1. Export any form with at least one question that is not Number.
  2. Open in Excel or other spreadsheet application.
  3. Select any cell in an answer column for a non-number question.
  4. Type 20 digits and Enter.
  5. Cell should show all 20 digits, overlaying the next cell (if necessary).
  6. Export any form with at least one question that is Number.
  7. Open in Excel or other spreadsheet application.
  8. Select any cell in an answer column for a number question.
  9. Type 10 or more digits and Enter.
  10. Narrow the column so that slightly less then all digits fit.
  11. Cell should show all ####, not the number in scientific notation.
rumca commented 8 years ago

test plan passes as described

@janagombitova is it an issue that the new format makes integers appear with a decimal point (and no 0 after it)?

screen shot 2016-05-12 at 13 31 34

janagombitova commented 8 years ago

@rumca does this happen when a user has a number question and submits the answer without decimals, signs? So, in your case the user typed in "2", "3".

If yes, then this is an issue. But I feel I am missing the full picture here

rumca commented 8 years ago

@janagombitova correct - I believe the new format assumes that all numerical responses will be decimals so always appends a '.' even if the response is '1'. Note that this is purely a visual thing in Excel though - the actual response doesn't have a decimal point.

@stellanl could you confirm?

stellanl commented 8 years ago

Is that in Excel? I do not see that in LibreOffice Calc.

rumca commented 8 years ago

yes excel

EDIT - I see the following as the custom format for the cells in excel https://github.com/akvo/akvo-flow/blob/release/1.9.7/GAE/src/org/waterforpeople/mapping/dataexport/GraphicalSurveySummaryExporter.java#L338

screen shot 2016-05-12 at 14 15 34

stellanl commented 8 years ago

Then I don't think there is a fix. There is only one way to specify "number, not in scientific format, show up to 3 decimals if necessary". Calc is smart enough to hide the decimal point/comma if all the decimals are 0. The underlying problem here is that our datatype Number is very broad/fuzzy, and other software has its own ideas about how to "do the right thing".

rumca commented 8 years ago

display issue has now been corrected for excel reports

I'll close this one for now until we get the release out, but I guess we'll have to reopen and come up with a different approach for fixing?

stellanl commented 8 years ago

The original problem of this issue, text answers, is fixed. The added goal of improving number answers was rolled back and is likely unfixable. The extra goal of indicating, or even locking, fields that should be untouched for import is totally feasible but I think should be a separate issue.

janagombitova commented 8 years ago

@stellanl thank you for explaining. Mulo and me have chatted about this issue and agree that this issue can be closed for now and whatever comes up regarding spread sheet imports will be tackled on an ad hoc basis.