Problem when generating synthetic data

cjvanlissa / worcs

Rstudio project template and convenience functions for the Workflow for Open Reproducible Code in Science (WORCS)

https://cjvanlissa.github.io/worcs/

GNU General Public License v3.0

76 stars 11 forks source link

Problem when generating synthetic data #108

Open brandmaier opened 3 years ago

brandmaier commented 3 years ago

Creating synthetic data from Allison Horst's penguin data fails (https://raw.githubusercontent.com/bvreede/worcshop/master/data/penguins.csv) The error is caused because there is one column of strings. I assume this could be fixed by turning this column into a factor. However, there is a good chance that people will have strings in their data files, e.g., a comment column. Should we, by default, drop all string columns (while issuing a warning) before generating synthetic data?

cjvanlissa commented 3 years ago

I believe we should not - the modeling is done entirely by external functions; ranger() by default.

Perhaps we can do something to explain the process of the synthesizing better to people, so that they understand that it might break on the data type?

brandmaier commented 3 years ago

Alright, then let's not treat this as a bug. I like the approach that we outsource this responsibility. Still, how do we best help people to proceed from this point on? I think it's difficult for beginners to figure out how to fix the problem of "synthetic data could not be generated".

cjvanlissa commented 3 years ago

It's VERY difficult. We're also limited by resources, of course. I know that Marvin Wright would be open for a pull request that makes ranger suitable for a wider range of input/outcome variables, so we could try to make ranger itself more flexible. Another solution would be to wrap ranger with a function that preprocesses variable types using some sensible defaults.

cjvanlissa commented 3 years ago

In the meanwhile, I removed the error wrapping from the call to synthetic(). This means that if data synthesis fails - the entire function fails, so users are more aware of the problem.