Open brandmaier opened 3 years ago
I believe we should not - the modeling is done entirely by external functions; ranger() by default.
Perhaps we can do something to explain the process of the synthesizing better to people, so that they understand that it might break on the data type?
Alright, then let's not treat this as a bug. I like the approach that we outsource this responsibility. Still, how do we best help people to proceed from this point on? I think it's difficult for beginners to figure out how to fix the problem of "synthetic data could not be generated".
It's VERY difficult. We're also limited by resources, of course. I know that Marvin Wright would be open for a pull request that makes ranger suitable for a wider range of input/outcome variables, so we could try to make ranger itself more flexible. Another solution would be to wrap ranger with a function that preprocesses variable types using some sensible defaults.
In the meanwhile, I removed the error wrapping from the call to synthetic(). This means that if data synthesis fails - the entire function fails, so users are more aware of the problem.
Creating synthetic data from Allison Horst's penguin data fails (https://raw.githubusercontent.com/bvreede/worcshop/master/data/penguins.csv) The error is caused because there is one column of strings. I assume this could be fixed by turning this column into a factor. However, there is a good chance that people will have strings in their data files, e.g., a comment column. Should we, by default, drop all string columns (while issuing a warning) before generating synthetic data?