SchlossLab / mikropml

User-Friendly R Package for Supervised Machine Learning Pipelines
http://www.schlosslab.org/mikropml
Other
54 stars 17 forks source link

Class strings cannot contain spaces in the outcome columns [JOSS review] #244

Closed JonnyTran closed 3 years ago

JonnyTran commented 3 years ago

I also encountered a bug when running run_ml() on a dataset with categorical classes containing spaces in the label column. The classes that caused errors are "progressive supranuclear palsy" and "pathological aging". I can mitigate the bug by replacing " " with "_", to transform to labels "progressive_supranuclear_palsy" and "pathological_aging".

Below are the codes to reproduce the bug.

amp_data_preproc <- preprocess_data(amp_ad_geneexp_dx, 'dx')$dat_transformed

Using 'dx' as the outcome column. Removed 1388/1917 (72.4%) of samples because of missing outcome value (NA).

result_amp <- run_ml(amp_data_preproc, 'glmnet')

Using 'dx' as the outcome column. Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to Alzheimer.Disease, control, pathological.aging, progressive.supranuclear.palsy . Please use factor levels that can be used as valid R variable names (see ?make.names for help).

It is expected that spaces can occur in the class labels data, or that preprocess_data() would automatically prepare the outcomes label.

Related to https://github.com/openjournals/joss-reviews/issues/3073

kelly-sovacool commented 3 years ago

Thank you for opening this, @JonnyTran. I'll take a look!

kelly-sovacool commented 3 years ago

Hi @JonnyTran, we went with your suggestion for preprocess_data() to automatically prepare the outcome column. Thanks for spotting this!