DillonHammill / DataEditR

An Interactive R Package for Viewing, Entering Filtering and Editing Data
https://dillonhammill.github.io/DataEditR/
384 stars 40 forks source link

Feature Request: "Data Validation" functionality #37

Open palderman opened 2 years ago

palderman commented 2 years ago

Hi @DillonHammill ,

I recently stumbled across your package in my search for an R solution for data entry/data capture. It seems like your package would largely fit the bill.

However, I don't see any capability for restricting the range of numerical input values as part of a "Data Validation" workflow when entering data (see e.g. https://datacarpentry.org/spreadsheet-ecology-lesson/04-quality-control/index.html for what I mean by "Data Validation"). The col_options argument seems mostly to address data validation for known lists of discrete values, but I see no way to restrict column type to integer or min/max values for numeric input.

Did I miss this somewhere in the documentation?

Where would you suggest I begin looking in your code to add this capability? I'm happy to collaborate/contribute code for this capability as it would greatly facilitate my own workflow.

DillonHammill commented 2 years ago

Hi @palderman, this feature has been on my to do list for a while now (#14) but I haven't had time to look at it further. I would be more than happy to review a PR to add improved support for validating data entries.

As discussed in #14 I think the col_options argument would be a good place to add support for this. For example we could do something similar to the code below to restrict entries in vs column of mtcars to range [0,1]:

data_edit(
  mtcars,
  col_options = list(vs = c(0,1))
)

At the moment, the above code would trigger the vs column to use a dropdown menu with options 0 and 1. This is because any vector passed to col_options that doesn't contain either logical values or "date" or "password" character strings is treated as the options for data entry - so we resort to using dropdown columns to prevent typing errors. I do think there will be cases where users will want the above code to work as is, i.e. have numeric options in dropdown menus - so perhaps we should try to keep this functionality.

An alternative would be to add more flexibility to col_options by allowing a list of arguments per column. For example, we could do something like this (vs is the name of the column):

# numeric dropdown
data_edit(
  mtcars,
  col_options = list(
    vs = list(
       type = "dropdown",
       options  = c(0,1)
  )
)

# numeric range
data_edit(
  mtcars,
  col_options = list(
    vs = list(
       range = c(0, 1),
       precision = 0
  )
)

In the above example each list could have options type, options, range and precision (and any other useful options).

Once we have a format for the new col_options we need to make changes to the code in dataEdit.R to make sure these options are appropriately handled. I think most of the changes will need to occur in this chunk of code which is triggered whenever the data changes (note that the column and row headers are handled separately in the chunk below). https://github.com/DillonHammill/DataEditR/blob/d2840dc8a58aa82740a3b3208d4db0fed40aa030/R/dataEdit.R#L260-L286

Basically all you would need to do to compare the columns defined in col_options in the new data (x_new) and make sure all the listed criteria are met - otherwise we replace that column with the previous version defined in x_old. This method is not very efficient as you will be checking all entries not just the ones that have changed (remember this code is triggered with every single change - i.e. each time only a single value will be changed).

I think there is a way to check exactly which cell has been edited by accessing slots in input$x but you will need to look at rhandsontable to see how this works.

Hope this helps, happy to answer any questions you may have along the way.

Dillon

palderman commented 2 years ago

Sorry for missing #14. I should have checked the existing issues more thoroughly.

I was thinking of something along the lines you outlined above for an integer with hard min/max values:

# numeric dropdown
data_edit(
  mtcars,
  col_options = list(
    vs = list(
       type = "integer",
       min  = 0,
       max = 10
  )
)

Errant values in that case would be rejected, but I was also thinking of being able to warn about out-of-range values instead of rejecting them out right. In that case, supposing you have an integer that is usually between 3 and 5, but on rare occasions could possibly have values as low as 0 or high as 10, we could specify it as:

# numeric dropdown
data_edit(
  mtcars,
  col_options = list(
    vs = list(
       type = "integer",
       range  = c(0, 10),
       range_warn = c(3, 5)
  )
)

Values outside range would be rejected and values outside range_warn but inside range could be highlighted in red or yellow. Thoughts?

DillonHammill commented 2 years ago

Looks like a good starting point. I have played around with colouring cells and it is complicated.

For DataEditR to work, we need to set useTypes = FALSE in rhandsontable() which means we cannot make calls to hot_col() or hot_cell() to format the cells - this is because useTypes = FALSE is required for the column/row addition/removal functionality.

Perhaps you will find a way to get this to work.

BaruqueRodrigues commented 2 years ago

Dillon, I have a question.

I'm running DataEditR:: inside shiny as I need to validate multiple databases, however col_options() has a stop(!all(colnames %in% df)) if the columns are not contained in the dataframe. Do you have any suggestions, other than rewriting the function locally, on how to deal with this issue?

DillonHammill commented 2 years ago

@BarqueRodrigues can you poibt me to exactly where that line exists in the code? I can take a look. Given that it is an error I probably put it there for a good reason.

BaruqueRodrigues commented 2 years ago

Dillon in line 47 on DataEditR::dataEditServer if (!is.null(col_readonly)) { if (!all(col_readonly %in% colnames(data_to_edit))) { stop("'col_readonly' must contain valid column names.") } values$col_names <- unique(c(col_names, col_readonly)) }

DillonHammill commented 2 years ago

So you are using col_readonly? In that case it must match the columns in the data. What is your use case? You have some code I can look at?