Functionality added for treating strings as character and dropping re…

pford221 commented 6 years ago

…ally sparse columns to one_hot() function. Makes one other small efficiency suggestion and passes dropUnusedLevels directly to data.table::dcast().

ben519 commented 6 years ago

Thanks for the PR. Can you please give more details on what you're trying to do and why? Thanks

ben519 commented 6 years ago

Actually, can you open an issue for this and just explain what you're trying to achieve in the issue? Thanks.

pford221 commented 6 years ago

Ben,

I'm trying to make one_hot() a bit more flexible by allowing it to one-hot encode character variables if stringsAsFactors = TRUE. I also thought it would be nice to have the option to drop very sparse one-hot encoded columns, so I added the parameter dropPct and a bit of code to check if the one-hot encoded column is denser than dropPct. For instance if the user sets dropPct = 0.01 than all one-hot encoded columns with less than 1% of rows that have 1's will be dropped from the final output. This is helpful when dealing with high cardinality categorical variables and you only want to focus on those one-hot encoded columns with a significant amount of variety (i.e. a good deal of 1s and 0s). I forgot to update the documentation the one_hot.RD file in my last PR, so I'm going to submit another PR. Hopefully this passes the checks, but this is my first ever PR so I'm not sure if I'm doing it right.

I also think I might have a PR for one of your the requests about being able to make one_hot() accept a vector (or matrix) in addition to a data.table. But I'll wait to submit that PR until I hear back from you on what you'd prefer.

ben519 commented 6 years ago

Okay, thanks for the PR, but I don't think I'm going to merge it. Here are some scattered thoughts and comments for you

It's probably best to open an issue first and discuss the change before you actually do it. Or, if you want to make the change first I'd still prefer to have it be tied to an issue.
I think the main reason I don't want to merge is because I prefer the user to set up their data types properly. If column X is categorical, make it a factor in your data.table. You can convert all your character columns to unordered factors with

for(col in colnames(dt))
  if(is.character(dt[[col]])) set(dt, j = col, value = factor(dt[[col]]))

Your code creates a copy of the entire data.table (d <- copy(dt)) which can cause a memory issue if it is large. My code was already creating a copy by doing tempDT <- dt[, cols, with=FALSE] but this only copies the minimal set of necessary columns.
This line in your code charCols <- cols[sapply(d[ ,cols, with = F], class) == 'character'] could run into issues if a column has more than one class. Better to use is.character().

I do appreciate the PR. (I don't get them often.) And thanks for using mltools!

pford221 commented 6 years ago

Hey, no problem.

Thanks for your thoughts/tips and I agree with many of them. Interestingly, I've never seen a column of strings in a data.frame or data.table have more than one class associated with it. For my own curiosity, do you have an example of this? I use the sapply(data.frame, class) paradigm a lot so it help me to know what sort of idiosyncrasies could happen.

ben519 commented 6 years ago

I don't have a great example, but here's one I made up

foo <- c("a", "b", "c")
class(foo) <- c("lowercase", "character")
dt <- data.table(x = c("a", "b", "c"), foo = foo)
class(dt$foo) == "character"

ben519 / mltools

Functionality added for treating strings as character and dropping re… #12