dmlc / XGBoost.jl

XGBoost Julia Package
Other
288 stars 110 forks source link

support for multi-dimensional "label" for regressions? #38

Open ExpandingMan opened 7 years ago

ExpandingMan commented 7 years ago

Hello all. I haven't dug too far into the source code yet, but I'm wondering if it's possible to do regressions where the "label" (target value) consists of multi-dimensional data points. (i.e. the label argument of the xgboost function would be an Array{T<:Number,2}.) This seems like a pretty important feature, but I can't find any literature about it in the xgboost documentation for any language.

It seems to me that even if it's not explicitly supported this should be possible by setting a custom loss function, however I get the following error any time I try to pass a matrix-valued "label":

ERROR: LoadError: MethodError: no method matching (::XGBoost.#_setinfo#8)(::Ptr{Void}, ::String, ::Array{Float64,2})
Closest candidates are:
  _setinfo{T<:Number}(::Ptr{Void}, ::String, ::Array{T<:Number,1}) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:10
 in (::XGBoost.##call#7#11)(::Array{Any,1}, ::Type{T}, ::Array{Float64,2}, ::Bool, ::Float32) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:59
 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{XGBoost.DMatrix}, ::Array{Float64,2}, ::Bool, ::Float32) at ./<missing>:0
 in makeDMatrix(::Array{Float64,2}, ::Array{Float64,2}) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:137
 in #xgboost#20(::Array{Float64,2}, ::Array{Any,1}, ::Array{Any,1}, ::Array{Any,1}, ::Type{T}, ::Type{T}, ::Array{Any,1}, ::Array{Any,1}, ::XGBoost.#xgboost, ::Array{Float64,2}, ::Int64) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:147
 in (::XGBoost.#kw##xgboost)(::Array{Any,1}, ::XGBoost.#xgboost, ::Array{Float64,2}, ::Int64) at ./<missing>:0
 in include_from_node1(::String) at ./loading.jl:488
while loading /home/user/RatingsPrediction/xgboost0.jl, in expression starting on line 43

Taking a look at the source code I get the impression it is not designed to pass labels that aren't Vectors into the C code. Certainly the above error seems to indicate that it is impossible to set a "label" that cannot be converted to Vector.

Is there any way around this? Does the Python API support this? Thanks.

slundberg commented 7 years ago

See https://github.com/dmlc/xgboost/blob/master/doc/parameter.md and the multi:softprob objective for how a vector output would be handled (as a flattened matrix).

However a deeper question is what you expect to happen in the gradient boosting regression model with a vector output that is different than running a separate model for each dimension. If you can clarify what you want to be different (other than just easier coding), then it will be easier to see if XGBoost supports that.

On Fri, Feb 3, 2017 at 8:16 AM ExpandingMan notifications@github.com wrote:

Hello all. I haven't dug too far into the source code yet, but I'm wondering if it's possible to do regressions where the "label" (target value) consists of multi-dimensional data points. (i.e. the label argument of the xgboost function would be an Array{T<:Number,2}.) This seems like a pretty important feature, but I can't find any literature about it in the xgboost documentation for any language.

It seems to me that even if it's not explicitly supported this should be possible by setting a custom loss function, however I get the following error any time I try to pass a matrix-valued "label":

ERROR: LoadError: MethodError: no method matching (::XGBoost.#_setinfo#8)(::Ptr{Void}, ::String, ::Array{Float64,2}) Closest candidates are: _setinfo{T<:Number}(::Ptr{Void}, ::String, ::Array{T<:Number,1}) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:10 in (::XGBoost.##call#7#11)(::Array{Any,1}, ::Type{T}, ::Array{Float64,2}, ::Bool, ::Float32) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:59 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{XGBoost.DMatrix}, ::Array{Float64,2}, ::Bool, ::Float32) at ./:0 in makeDMatrix(::Array{Float64,2}, ::Array{Float64,2}) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:137 in #xgboost#20(::Array{Float64,2}, ::Array{Any,1}, ::Array{Any,1}, ::Array{Any,1}, ::Type{T}, ::Type{T}, ::Array{Any,1}, ::Array{Any,1}, ::XGBoost.#xgboost, ::Array{Float64,2}, ::Int64) at /home/user/.julia/v0.5/XGBoost/src/xgboost_lib.jl:147 in (::XGBoost.#kw##xgboost)(::Array{Any,1}, ::XGBoost.#xgboost, ::Array{Float64,2}, ::Int64) at ./:0 in include_from_node1(::String) at ./loading.jl:488 while loading /home/user/RatingsPrediction/xgboost0.jl, in expression starting on line 43

Taking a look at the source code I get the impression it is not designed to pass labels that aren't Vectors into the C code. Certainly the above error seems to indicate that it is impossible to set a "label" that cannot be converted to Vector.

Is there any way around this? Does the Python API support this? Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dmlc/XGBoost.jl/issues/38, or mute the thread https://github.com/notifications/unsubscribe-auth/ADkTxR-9mV7wWqzIJmMPFQo25l-uaQhBks5rY1LogaJpZM4L2gs5 .

ExpandingMan commented 7 years ago

Thanks for your prompt response.

I don't see any significant problem with using multiple models (as far as I can think, in the case of gradient boosted trees this should be exactly equivalent to "one" multi-dimensional model). Of course, one usually doesn't have to resort to this (from an API standpoint), hence the issue. Apart from convenience, I'd be a bit concerned about performance issues if I were fitting in a high-dimensional space, but perhaps that's unwarranted.

slundberg commented 7 years ago

Deep learning API's often allow vector output because they share parameters during such multitask learning. My guess is since GBM's don't typically do this, running separate models is the most explicit way of doing this without implying that any parameter sharing is happening. I think Tianqi wrote a paper with Carlos a while back on accounting for certain types of dependence among the output features, so you might also check that out if you want.

On Fri, Feb 3, 2017 at 9:49 AM ExpandingMan notifications@github.com wrote:

Thanks for your prompt response.

I don't see any significant problem with using multiple models (as far as I can think, in the case of gradient boosted trees this should be exactly equivalent to "one" multi-dimensional model). Of course, one usually doesn't have to resort to this (from an API standpoint), hence the issue. Apart from convenience, I'd be a bit concerned about performance issues if I were fitting in a high-dimensional space, but perhaps that's unwarranted.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/dmlc/XGBoost.jl/issues/38#issuecomment-277314496, or mute the thread https://github.com/notifications/unsubscribe-auth/ADkTxcgs7lxRJCWNUCMsUdKnn-edWr3zks5rY2iNgaJpZM4L2gs5 .

mangolzy commented 1 year ago

I have a related confuse, according to some out-of-date documentation, eg: [https://xgboost.readthedocs.io/en/release_0.72/python/python_api.html]() label ([list] or numpy 1-D array, optional) – Label of the training data. it seems only 1-D array is accepted as label for construction of matrix. but from the newly created version, [https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.training]() label (array_like) – Label of the training data. the form of label is of no limit, and we could pass a 2-D array as label that's true, but a strange thing come out, that when we use dmatrix.get_label() to look at this 2-D array, it seems the underground process has done a flatten and just keep the first "sample length" elements, like this:

X = pd.DataFrame(data=[[1,0], [2,2], [0,3], [4,4]])
y = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
dsoft_fake = xgb.DMatrix(X.values, label=y)
dsoft_fake.get_label()

output:

array([1., 2., 3., 4.], dtype=float32)

so my question is,

  1. if 2-D array is accepted for label, how it should be use correctly under which circumstance, or for solving what kind of problem?
  2. or if we do want to set the label of one sample point as vector, which can be consider as a soft label consists of different probabilities for different classes(>2), and they sum up to 1, is xgboost support this feature now? in this case, i don't think separate model for each dimension is suitable

thanks for explanation in advance

trivialfis commented 1 year ago

The matrix input for labels is a recent addition (1.6) for multi-output and multi-label, the getter hasn't been able to return the matrix yet.