ModelOriented / forester

Trees are all you need
https://modeloriented.github.io/forester/
GNU General Public License v3.0
113 stars 15 forks source link

[BUG] Data check halts due to unexpected input #109

Closed madprogramer closed 1 year ago

madprogramer commented 1 year ago

While performing check_data on a dataset, forester encountered an error before it could finish generating its report.

check <- check_data(databank, 'das28_remission_m0')

Error and Traceback:

Error in str2lang(x): <text>:1:1448: unexpected input
1: crp_m0broad+ega_m0broad+pain_m0broad+haq_m0broad+fatigue_m0broad+boolean_remission_m0broad+sdai_m0broad+sdai_remission_m0broad+booleanremission_3items_m0broad+das28_remission26_m0broad+das28_r

Traceback:

1. check_data(databank, outcome)
2. manage_missing(df, y)
3. mice::mice(df, seed = 123, print = FALSE)
4. make.formulas(data, blocks)
5. lapply(formulas, as.formula)
6. FUN(X[[i]], ...)
7. formula(object, env = baseenv())
8. formula.character(object, env = baseenv())
9. str2lang(x)

This error occurs before the Dimensionality Check step when calling manage_missing

 -------------------- CHECK DATA REPORT -------------------- 

The dataset has 1062 observations and 324 columns, which names are: 
patient_id; country; cohort_name; ...

With the target value described by a column das28_remission_m0.

✖ Static columns are: 
country; cohort_name; ... 

✖ With dominating values: 
...

✖ These column pairs are duplicate:
...

✖ 198 Target values are missing. 
✖ 1062 observations have missing fields.

Any idea what might be going on?

madprogramer commented 1 year ago

As a final note, I tried passing the same formula to str2lang, but it worked perfectly fine.

str2lang('crp_m0broad+ega_m0broad+pain_m0broad+haq_m0broad+fatigue_m0broad+boolean_remission_m0broad+sdai_m0broad+sdai_remission_m0broad+booleanremission_3items_m0broad+das28_remission26_m0broad+das28_r')
> crp_m0broad + ega_m0broad + pain_m0broad + haq_m0broad + fatigue_m0broad + 
    boolean_remission_m0broad + sdai_m0broad + sdai_remission_m0broad + 
    booleanremission_3items_m0broad + das28_remission26_m0broad + 
    das28_r
HubertR21 commented 1 year ago

Can you provide a dataset or the link to it?

madprogramer commented 1 year ago

Unfortunately my dataset is confidential, but I suspect it's the abundance of NA values which messes things up.

I can try and submit a reprex using the african-names dataset, that's one of the more popular ones with missing values.

madprogramer commented 1 year ago

Ok, I think I have solved it.

Somehow my types were mixed up so I had to manually convert into as.factor and as.double for variables that were mis-represented as strings.

So the short answer is: "This might happen if your factors are being miscast as string".

That solved the issue mostly, but now it gets stuck after ranking models.

 -------------------- CHECK DATA REPORT END -------------------- 

✔ Data preprocessed. 
✔ Data split and balanced. 
✔ Correct formats prepared. 
✔ Models successfully trained. 
✔ Predicted successfully. 
✔ Ranked and models list created. 
Error in test_observed_labels[i] <- preprocessed_data$bin_labels[1]: replacement has length zero
Traceback:

1. forester::train(data = databank_sub, y = outcome, bayes_iter = 0, 
 .     random_evals = 0, advanced_preprocessing = FALSE, type = "binary_clf", 
 .     verbose = TRUE)

I suppose this error has a different reason @HubertR21