JuliaAI / DecisionTree.jl

Julia implementation of Decision Tree (CART) and Random Forest algorithms
Other
356 stars 102 forks source link

Missing data #10

Open rpoplin opened 10 years ago

rpoplin commented 10 years ago

I've been trying out this library as I jump into learning Julia and I'm wondering what support there is for missing values in the dataset. Any recommendations that you have based on your experience for how to deal with these missing values would be very helpful.

julia> model = build_forest(labels, features, 3, 10)

exception on 1: ERROR: no method convert(Type{Bool}, NAtype) in setindex! at array.jl:298 in bitcache_lt at broadcast.jl:366 in .< at broadcast.jl:382 in build_tree at .julia/v0.3/DecisionTree/src/DecisionTree.jl:153 in build_tree at .julia/v0.3/DecisionTree/src/DecisionTree.jl:171 in anonymous at no file:237 in anonymous at multi.jl:1263 in run_work_thunk at multi.jl:613 in run_work_thunk at multi.jl:622 in anonymous at task.jl:6 ERROR: no method convert(Type{Node}, MethodError) in copy! at abstractarray.jl:149 in convert at array.jl:209 in build_forest at .julia/v0.3/DecisionTree/src/DecisionTree.jl:239 in build_forest at .julia/v0.3/DecisionTree/src/DecisionTree.jl:232

bensadeghi commented 10 years ago

Unfortunately, there is currently no support for missing values. This is on the roadmap, but has yet to be implemented. One option is to remove the samples/rows which contain missing values, or to remove the features/columns where these missing values pop up. You could also fill in the missing values with some sort of flag of the same type as the column. This would be a work-around for training with missing data, but it could return funny results. These flags should have values that fall in the extremes of the column data, since splitting is done via the "<" operator. For example, if your column takes the values 1,2,3,4, your flag could be 999. But in the case where your feature takes on boolean values, you might be forced to change the data type to integer (0,1,999) or string ("T","F","X"). I hope this helps a bit.

kmsquire commented 10 years ago

It might be worthwhile integrating with DataArrays and/or DataFrames.

On Tuesday, April 22, 2014, Ben Sadeghi notifications@github.com wrote:

Unfortunately, there is currently no support for missing values. This is on the roadmap, but has yet to be implemented. One option is to remove the samples/rows which contain missing values, or to remove the features/columns where these missing values pop up. You could also fill in the missing values with some sort of flag of the same type as the column. This would be a work-around for training with missing data, but it could return funny results. These flags should have values that fall in the extremes of the column data, since splitting is done via the "<" operator. For example, if your column takes the values 1,2,3,4, your flag could be 999. I hope this helps a bit.

— Reply to this email directly or view it on GitHubhttps://github.com/bensadeghi/DecisionTree.jl/issues/10#issuecomment-41121887 .

bensadeghi commented 10 years ago

@rpoplin Here are some guidelines for dealing with missing values: http://people.eecs.ku.edu/~jerzy/b24-miss.pdf @kmsquire Yeah, perhaps it's time to have DataFrames as a dependency and handle NAs properly. I'll try to have a go at it over the weekend.

ValdarT commented 7 years ago

DataFrames and NA support would be nice indeed. Perhaps using PooledDataArray could make it faster as well.

ValdarT commented 7 years ago

Seems that the story of missing values (in DataFrames, DataStream etc.) has finally come to a conclusion with the Missings approach. Perhaps it's a good time to revisit this issue.