DillonHammill / DataEditR

An Interactive R Package for Viewing, Entering Filtering and Editing Data
https://dillonhammill.github.io/DataEditR/
381 stars 40 forks source link

Data validation #14

Open higgi13425 opened 3 years ago

higgi13425 commented 3 years ago

It would be great if you could

  1. fix the data type for each variable before you begin entering data, or repair them later if guessed wrong
  2. set valid ranges for each variable. i.e. systolic blood pressure in healthy adults, 70-160. Out of range values challenged, suggesting you change the range if this value is correct.
  3. Set up allowed values for factor variables - drop-down to limit to only these.

Standard paid data entry expects errors on 1-2% of fields. For every 100 fields, 1-2 errors. It adds up. It is a big deal. Lots of these could be prevented with fixed data types and value ranges. Common to get race : white, White, Caucasian, Black, black, African-American, African-american, aa, etc.

DillonHammill commented 3 years ago

@higgi13425, thanks for the feedback!

  1. rhandsontable does support column types but unfortunately it removes the ability to add/remove rows/columns. To get around this, DataEditR does not assign column types at the rhandsontable level but instead sorts them out afterwards using utils::type.convert(). This means that class of the column is dependent on the data that is entered, i.e. if there is a character entered it will be converted to class character.

  2. As I mentioned, unfortunately rhandsontable does not and will never support slider inputs for cells. Newer versions of Handsontable come with licence restrictions and so rhandsontable uses a fixed and older version of Handsontable. Implementing this at the level of DataEditR is possible potentially through the col_options argument where these limits could be supplied. This does however, mean that the data with need to be checked with every edit which may be very inefficient. I can certainly play around with it and see whether it is worth implementing or not.

  3. This feature is already implemented! If you have specific factor levels for a column just pass them to the col-options for that column and a dropdown menu will appear. See below:

    data_edit(iris,
          col_options = list(Species =  c("setosa", "virginica", "versicolor"))

Now that I think about it, it may be worthwhile supporting this as well: (we could grab the factor levels form the data directly)

data_edit(iris,
          col_options = list(Species = "dropdown")
higgi13425 commented 3 years ago

The dropdowns and checkboxes and date selection are great for data entry error prevention - but it would be awesome to somehow add limits to fields - min and max for weight/height/birthdate/systolicBP - so an out-of-range entry (e.g blood pressure of 1400) will be rejected. Error rates for data entry run 3-6% per cell. Pro-active error prevention is really important for valuable data.

DillonHammill commented 3 years ago

@higgi13425, I like the idea of validating column entries.

I guess I could do something like this for numeric columns:

data_edit(mtcars,
          col_validate = list(vs = c(0,1)) # make sure vs values are between 0 and 1

Similarly for character columns:

data_edit(iris,
          col_validate = list(Species = c("setosa", 
                                          "versicolor", 
                                          "virginica")) # must match exactly

The main question is what do you expect to happen when the entered data does not match these requirements? Do we make the cell empty again?

Also I suspect that this would only be supported for columns that don't use checkboxes or dropdown menus.

Note to self: need to add NA as accepted entry for empty cells.

DillonHammill commented 3 years ago

Looking at my previous comments, I think it would be a better idea to extend this functionality to col_options() instead. The reason for this is that if levels are set for a character column then we should use dropdowns (user can still type and best match displayed) but for numeric columns we just check if the data is within range or remove it.

data_edit(iris,
          col_options = list(Species = c("setosa", "versicolor", "virginica"),   # dropdown
                             Sepal.Length = c(0, 10)))                           # range

The challenge will be in checking the edited data, particularly since the entire dataset is returned with each edit. I will need to look at the internals of rhandsontable to see if I can get information about specific edits and check those against the supplied range.

I will get to this eventually, but it is unlikely that I will have time to do this in the next couple of months.

DillonHammill commented 3 years ago

Adding a note to take a look at the pointblank package when I get time to address this request.