NSAPH-Software / GPCERF

https://NSAPH-Software.github.io/GPCERF/
GNU General Public License v3.0
9 stars 3 forks source link

The interface relies on a very restrictive data format (which is then only partially checked) #70

Closed martinmodrak closed 11 months ago

martinmodrak commented 1 year ago

While the interface is usable as is, I think requiring a specific ordering and naming scheme of a data.frame is an anti pattern.

A simple remedy would be to add parameters for the names of response, treatment and confounder columns (and then immediately translate internally to the fixed format) - the defaults for those parameters could then respect the current format. A formula-style syntax would be even nicer (e.g. the README examples could be specified via formula = Y ~ treat, confounders = ~ cf1 + cf2 + cf3 + cf4 + cf5 + cf6

Additionally, the input format is not very thoroughly checked and interpreted, I've encountered two issues - in both cases, I am modifying the GP example from README.md by inserting some code just below line 3 (sim_data <- generate_synthetic_data(sample_size = 500, gps_spec = 1))

Just a single covariate:

Using

sim_data <- sim_data[, 1:3]

results in

Error in compute_w_corr(w = data[[2]], covariate = data[, 3:ncol(data)],  : 
  covariate should be a data.frame, the provided one is:  numeric

(presumably drop = FALSE is missing in some indexing statement)

Renaming outcome

Using

names(sim_data)[1] <- "Y2"
sim_data$Y <- rnorm(500)

Completes, but for some reason provides different results than the original as well as a version with just the new irrelevant covariate without the renaming (i.e. just having sim_data$new_cf <- rnorm(500)). It appears the code sometimes refer to the outcome column Y by name and sometimes by index.

Naeemkh commented 1 year ago

Thanks @martinmodrak, this really improved the code.

martinmodrak commented 11 months ago

There appears to be no change to the docs website or the readme (presumably because you are working on a separate JOSS branch ? But the README is identical there as well ... ) making it a bit hard for me to see and test the changes made... Could you update the docs or show me some examples on how the new interface looks like?

Naeemkh commented 11 months ago

Hi @martinmodrak, Thanks for the comment. Much of the work on this issue centered around addressing the background bug you accurately identified, which occurred when selecting a single covariate in the study. Regarding the interface, we have eliminated the specific ordering of data and now gather it according to each column name. In forthcoming major updates, we plan to incorporate a formula for data handling. We updated the documentation. In the readme file example, the following input values have been added.

 outcome_col = "Y",
 treatment_col = "treat",
 covariates_col = paste0("cf", seq(1,6)),