ben519 / mltools

Exploratory and diagnostic machine learning tools for R
Other
72 stars 26 forks source link

Functionality added for treating strings as character and dropping re… #12

Closed pford221 closed 6 years ago

pford221 commented 6 years ago

…ally sparse columns to one_hot() function. Makes one other small efficiency suggestion and passes dropUnusedLevels directly to data.table::dcast().

ben519 commented 6 years ago

Thanks for the PR. Can you please give more details on what you're trying to do and why? Thanks

ben519 commented 6 years ago

Actually, can you open an issue for this and just explain what you're trying to achieve in the issue? Thanks.

pford221 commented 6 years ago

Ben,

I'm trying to make one_hot() a bit more flexible by allowing it to one-hot encode character variables if stringsAsFactors = TRUE. I also thought it would be nice to have the option to drop very sparse one-hot encoded columns, so I added the parameter dropPct and a bit of code to check if the one-hot encoded column is denser than dropPct. For instance if the user sets dropPct = 0.01 than all one-hot encoded columns with less than 1% of rows that have 1's will be dropped from the final output. This is helpful when dealing with high cardinality categorical variables and you only want to focus on those one-hot encoded columns with a significant amount of variety (i.e. a good deal of 1s and 0s). I forgot to update the documentation the one_hot.RD file in my last PR, so I'm going to submit another PR. Hopefully this passes the checks, but this is my first ever PR so I'm not sure if I'm doing it right.

I also think I might have a PR for one of your the requests about being able to make one_hot() accept a vector (or matrix) in addition to a data.table. But I'll wait to submit that PR until I hear back from you on what you'd prefer.

ben519 commented 6 years ago

Okay, thanks for the PR, but I don't think I'm going to merge it. Here are some scattered thoughts and comments for you

for(col in colnames(dt))
  if(is.character(dt[[col]])) set(dt, j = col, value = factor(dt[[col]]))

I do appreciate the PR. (I don't get them often.) And thanks for using mltools!

pford221 commented 6 years ago

Hey, no problem.

Thanks for your thoughts/tips and I agree with many of them. Interestingly, I've never seen a column of strings in a data.frame or data.table have more than one class associated with it. For my own curiosity, do you have an example of this? I use the sapply(data.frame, class) paradigm a lot so it help me to know what sort of idiosyncrasies could happen.

ben519 commented 6 years ago

I don't have a great example, but here's one I made up

foo <- c("a", "b", "c")
class(foo) <- c("lowercase", "character")
dt <- data.table(x = c("a", "b", "c"), foo = foo)
class(dt$foo) == "character"