epinowcast / coerceDT

Ingest-and-check user-provided input to standardized `data.table`s
Other
3 stars 0 forks source link

Prepare 0.1.0 release #3

Open seabbs opened 1 year ago

seabbs commented 1 year ago

We have had a fair few months with no progress on this (for understandable reasons). Shall we setup a meeting with those interested in contributing and hash out some issues that we can PR against to at least get an MVP version that can replace the functionality in epinowcast.

pearsonca commented 11 months ago

@seabbs could use a second set of eyes on https://github.com/epinowcast/coerceDT/blob/main/R/coerceDT.R - thoughts on the interface offered?

Not quite done w/ implement + test for it, but am thinking:

pearsonca commented 11 months ago

@seabbs I think coerceDT + tests is basically done, moving on to checkDT.

pearsonca commented 11 months ago

@seabbs this is now approaching 0.1.0 release readiness.

How do we want to check that it works how we want it to work? Propose an epinowcast & scoringutils branch => introduce this as a dependency => modify those library codebases (likely also this library code to do more of what we want for them) => iterate until we're happy with what this does for those libraries.

Related: there's definitely some documentation issues to deal with - I think that's best dealt with by another set of eyes, at least to identify the WTF bits. Useful candidates for that?

pearsonca commented 11 months ago

Tagging @nikosbosse - I understand you've got an internal helper function like this in scoringutils. If it potentially makes sense to substitute this in, please advise (with the helper function name(s)) and I'll have a look.

nikosbosse commented 11 months ago

The version in scoringutils is merely this:

scoringutils:::ensure_data.table
function(data) {
  if (is.data.table(data)) {
    data <- copy(data)
  } else {
    data <- as.data.table(data)
  }
  return(data)
}

Maybe there is room for more sophistication :)

Regarding the version you linked to: It seems quite complicated to me and the function does a lot of different things (selecting / dropping columns, reading in data, conversion to data.table, selecting whether or not that happens by reference...). For me as a newbie user I think it would be helpful to separate these things a bit more.

Also since we're currently reworking scoringutils in terms of S3 methods I now see them everywhere... But having coerceDT be a generic seems like it could be a good idea.

pearsonca commented 11 months ago

The version in scoringutils is merely this:

scoringutils:::ensure_data.table
function(data) {
  if (is.data.table(data)) {
    data <- copy(data)
  } else {
    data <- as.data.table(data)
  }
  return(data)
}

Maybe there is room for more sophistication :)

Indeed! In terms of just replacing that function, that would literally just be: ensure_data.table(data) => coerceDT(data) (though now data could be a file path or any other string fread just handles).

If you have places internally in scoringutils where you don't need to copy, those would become coerceDT(data, copy = FALSE).

pearsonca commented 11 months ago

I did fiddle a bit with S3 methods - found it easier to just handle dispatch based on type in the internal function logic.

pearsonca commented 11 months ago

@nikosbosse had a quick peek at scoringutils - for example check_forecasts might become something like

check_forecasts <- function(data) {

  # create lists to store results ----------------------------------------------
  out <- list()
  warnings <- list()
  errors <- list()
  messages <- list()

  data <- makeDT(data, require = c("true_value", "prediction"), forbid = available_metrics())

  # ... other more complex checks, starting from the bit looking at models
nikosbosse commented 11 months ago

The helper function, ensure_data.table linked above is currently on the develop branch - the current version does the checks for required columns independently of that, but in principle it could also be a single function (there is currently an internal function check_columns_present())

pearsonca commented 10 months ago

@seabbs per discussion re nomenclature - how about asDT?