ModelOriented / forester

Trees are all you need
https://modeloriented.github.io/forester/
GNU General Public License v3.0
113 stars 15 forks source link

[ISSUE] Forester cannot train if preprocessed data is unsplittable #110

Closed madprogramer closed 10 months ago

madprogramer commented 1 year ago

Training using forester with these settings

output_1 <- forester::train(data         = databank,
                  y            = outcome,
                  bayes_iter   = 0,
                  random_evals = 0,
                  fractions = c(0.6, 0.2, 0.2),
                  verbose      = TRUE)

head(output_1$score_test)

gave me the following error:

Error in splitTools::partition(target, p = c(train = fractions[1], test = fractions[2], : (n <- length(y)) >= 2L is not TRUE
Traceback:

1. forester::train(data = databank_sub, y = outcome, bayes_iter = 0, 
 .     random_evals = 0, verbose = TRUE)
2. train_test_balance(preprocessed_data$data, y, balance = TRUE, 
 .     fractions = train_test_split, seed = split_seed)
3. splitTools::partition(target, p = c(train = fractions[1], test = fractions[2], 
 .     valid = fractions[3]), seed = seed)
4. stopifnot(length(p) >= 1L, p > 0, is.atomic(y), (n <- length(y)) >= 
 .     2L)

Output from verbose mode:

✖ Provided dataset is a tibble and not a data.frame or matrix. Casting the dataset to data.frame format. 

✔ Type guessed as:  regression 

 -------------------- CHECK DATA REPORT -------------------- 

The dataset has 1062 observations and 17 columns, which names are: 
anticcp; current_smoker_latest; sjc28_m0; tjc28_m0; pga_m0; crp_m0; ega_m0; pain_m0; sex; erosive_status_baseline; haq_m0; fatigue_m0; age; prednisolone_oral_m0; igm_rf; symptom_duration_months; pga_m6; 

With the target value described by a column pga_m6.

✔ No static columns. 

✔ No duplicate columns.

✖695 Target values are missing. 
✖ 900 observations have missing fields.

✔ No issues with dimensionality. 

✖ Strongly correlated, by Spearman rank, pairs of numerical values are: 

 pga_m0 - pain_m0: 0.87;
 pga_m0 - fatigue_m0: 0.73;

✔ No strongly correlated, by Crammer's V rank, pairs of categorical values. 

✖ There are more than 50 possible outliers in the data set, so we are not printing them. They are returned in the output as a vector. 

✖ Multilabel classification is not supported yet. 

✔ Columns names suggest that none of them are IDs. 

✔ Columns data suggest that none of them are IDs. 

 -------------------- CHECK DATA REPORT END -------------------- 

✔ Data preprocessed. 
**ERROR**

Just to confirm that the splitTools::partition error was caused by preprocessing, I tried calling train_test_balance with the original dataset and that worked just fine.

train_test_balance(databank, outcome, balance = TRUE, fractions = c(0.6, 0.2, 0.2), seed = NULL)
> Outputs Tables

Any clue what might be going on here?

HubertR21 commented 1 year ago

Can you tell me what do you provide as an 'outcome' variable? It should be a string value with the column name.

madprogramer commented 1 year ago

The outcome variable is given as a string similar to the vignette examples. It's a tibble column (pga_m6) which is a numeric ranging from 0 to 8, possibly NA.

HubertR21 commented 1 year ago

Sorry for such a long break, I was swamped up with other responsibilities. If the outcome is given by the string, although it should be numeric, then the automatic guess for the target is multilabel classification, which is still unavailable task for the package, and might lead to corrupted results.

In a few days, we will provide a new custom preprocessing module which might answer your issue.