Open rpoplin opened 10 years ago
Unfortunately, there is currently no support for missing values. This is on the roadmap, but has yet to be implemented. One option is to remove the samples/rows which contain missing values, or to remove the features/columns where these missing values pop up. You could also fill in the missing values with some sort of flag of the same type as the column. This would be a work-around for training with missing data, but it could return funny results. These flags should have values that fall in the extremes of the column data, since splitting is done via the "<" operator. For example, if your column takes the values 1,2,3,4, your flag could be 999. But in the case where your feature takes on boolean values, you might be forced to change the data type to integer (0,1,999) or string ("T","F","X"). I hope this helps a bit.
It might be worthwhile integrating with DataArrays and/or DataFrames.
On Tuesday, April 22, 2014, Ben Sadeghi notifications@github.com wrote:
Unfortunately, there is currently no support for missing values. This is on the roadmap, but has yet to be implemented. One option is to remove the samples/rows which contain missing values, or to remove the features/columns where these missing values pop up. You could also fill in the missing values with some sort of flag of the same type as the column. This would be a work-around for training with missing data, but it could return funny results. These flags should have values that fall in the extremes of the column data, since splitting is done via the "<" operator. For example, if your column takes the values 1,2,3,4, your flag could be 999. I hope this helps a bit.
— Reply to this email directly or view it on GitHubhttps://github.com/bensadeghi/DecisionTree.jl/issues/10#issuecomment-41121887 .
@rpoplin Here are some guidelines for dealing with missing values: http://people.eecs.ku.edu/~jerzy/b24-miss.pdf @kmsquire Yeah, perhaps it's time to have DataFrames as a dependency and handle NAs properly. I'll try to have a go at it over the weekend.
DataFrames and NA support would be nice indeed. Perhaps using PooledDataArray could make it faster as well.
I've been trying out this library as I jump into learning Julia and I'm wondering what support there is for missing values in the dataset. Any recommendations that you have based on your experience for how to deal with these missing values would be very helpful.
julia> model = build_forest(labels, features, 3, 10)
exception on 1: ERROR: no method convert(Type{Bool}, NAtype) in setindex! at array.jl:298 in bitcache_lt at broadcast.jl:366 in .< at broadcast.jl:382 in build_tree at .julia/v0.3/DecisionTree/src/DecisionTree.jl:153 in build_tree at .julia/v0.3/DecisionTree/src/DecisionTree.jl:171 in anonymous at no file:237 in anonymous at multi.jl:1263 in run_work_thunk at multi.jl:613 in run_work_thunk at multi.jl:622 in anonymous at task.jl:6 ERROR: no method convert(Type{Node}, MethodError) in copy! at abstractarray.jl:149 in convert at array.jl:209 in build_forest at .julia/v0.3/DecisionTree/src/DecisionTree.jl:239 in build_forest at .julia/v0.3/DecisionTree/src/DecisionTree.jl:232