dmlc / XGBoost.jl

XGBoost Julia Package
Other
288 stars 110 forks source link

problem with XGBoost.predict on booster created with multi:softprob #143

Closed bobaronoff closed 1 year ago

bobaronoff commented 1 year ago

Am modeling the Framingham Heart Study dataset from Kaggle. Dataset has been prepared for XGBoost. Model trained with objective set to multi:softprob. When I execute XGBoost.predict against the training data, the probality matrix returned has extremely poor accuracy. I trained the identical processed dataset using xgboost in R. The training evaluation mlogloss for both approaches are near identical. In R, when I 'predict' the probability matrix against the training data I get a completely different set of matrix values. The accuracy of the predictions on the R output is very high. Is it possible there is an error in XGBoost.predict for the multi:softprob objective? It does not appear to be an issue with the reshaping of the data. Does XGBoost.predict differentiate between output from multi:softmax and multi:softprob - just trying to imagine where the issue could be.

I am happy to send my processed dataset as well as code I've used in Julia and R, just let me know where/how to send.

Thank you.

bobaronoff commented 1 year ago

Slight correction to the comment above. The dataset is the Cleveland Heart Disease Dataset on Kaggle ( and not the Framingham Dataset ). see here

bobaronoff commented 1 year ago

I may have discovered a clue this issue. Reviewing the xgboost package in R, I see that the data return by multi:softprob is reshaped with the parameter ByRow=TRUE. The XGBoost.predict function reshapes the return by column (a Julia standard).

These are samples of the return data:
First R:xgboost

         [,1]        [,2]        [,3]        [,4]        [,5]
  [1,] 0.74844402 0.076934353 0.091703311 0.045024108 0.037894208
  [2,] 0.04887400 0.119278572 0.402187079 0.258290142 0.171370193
  [3,] 0.03746652 0.363052756 0.201905355 0.269393027 0.128182322
  [4,] 0.74403465 0.082675613 0.036393773 0.054169908 0.082726091
  [5,] 0.97219545 0.012219272 0.004445153 0.007631002 0.003509163
  [6,] 0.95536584 0.025309743 0.007068976 0.007578034 0.004677375
  [7,] 0.18239844 0.085472584 0.108911745 0.526450872 0.096766345
  [8,] 0.79472923 0.120739311 0.035574537 0.031019097 0.017937779
  [9,] 0.04981138 0.262105495 0.613966107 0.052373748 0.021743242
 [10,] 0.14387728 0.431739688 0.183848411 0.166522682 0.074011944
 [11,] 0.69223869 0.154587910 0.121800318 0.019663321 0.011709797
 [12,] 0.87882787 0.057353288 0.018981079 0.030814752 0.014022978
 [13,] 0.24337377 0.244023025 0.416107476 0.068944089 0.027551638
 [14,] 0.87830919 0.084951192 0.011760626 0.017202126 0.007776847
 [15,] 0.87809825 0.077949531 0.014065672 0.019527804 0.010358773

here is Julia:XGBoost

0.73418    0.455742    0.0992902  0.0310137   0.86518
 0.0806226  0.471391    0.125122   0.141359    0.075895
 0.0939127  0.0309978   0.508063   0.327048    0.0209923
 0.0537244  0.022134    0.195827   0.45404     0.0272531
 0.0375606  0.0197343   0.0716976  0.046539    0.0106794
 0.0474781  0.876965    0.147057   0.04851     0.0322598
 0.114875   0.0697099   0.0898418  0.497638    0.101635
 0.367417   0.0190271   0.0882298  0.278397    0.415749
 0.265516   0.025858    0.564595   0.142559    0.350332
 0.204714   0.00843995  0.110276   0.0328961   0.100024
 0.0397737  0.829053    0.0482143  0.750835    0.386893
 0.350956   0.0975527   0.565365   0.125423    0.250349
 0.192763   0.0195801   0.13423    0.0732187   0.162975
 0.289934   0.0415913   0.223481   0.0358532   0.156164
 0.126573   0.012223    0.0287095  0.0146702   0.0436192

If you read down the first column of the Julia data it correlates to reading across rows of the R data. As I stated initially. The R data seems to be the correct orientation. If this is agreed then the reshaping done in XGBoost.predict may need revised.

bobaronoff commented 1 year ago

I am even more convinced that the output from XGBoost.predict is mis-shaped. After reshaping the output, I explicitly calculated the mlogloss and it matches the output during training.

To reshape the output I used the following

# p2 is the matrix returned by XGBooster.predict
p2correct=transpose(reshape(reshape(p2,size(p2)[1]*size(p2)[2]),(size(p2)[2],size(p2)[1])))

If anyone knows a simpler way would love to hear.

ExpandingMan commented 1 year ago

Yikes 😬

The issue is that I failed to translate the row-major output of libxgboost into the column major arrays of Julia, resulting in a scrambled matrix. I missed this because there were not tests for it originally and I tend to take regression a lot more seriously than classification so I guess I never actually checked this.

This fixed in #144 which also adds unit tests.

I'm going to merge and tag this pretty soon since it's a pretty horrific issue.

ExpandingMan commented 1 year ago

This is fixed on master. The fix will land in the registry as soon as https://github.com/JuliaRegistries/General/pull/74265 is merged.