dmlc / XGBoost.jl

XGBoost Julia Package
Other
288 stars 110 forks source link

DMatrix is now an AbstractMatrix #136

Closed ExpandingMan closed 1 year ago

ExpandingMan commented 1 year ago

This is added by virtue of the new XGDMatrixGetDataCSR method in libxgboost 1.7 which is now required. Unit tests for DMatrix constructors have been vastly improved as they now check that the matrices were actually constructed correctly rather than merely having the same shape.

The behavior I'm seeing in the presence of null values doesn't make sense to me, so this issue is going to have to be resolved before this gets merged.

ExpandingMan commented 1 year ago

I'm admittedly still somewhat confused here, so confirmation of what I'm about to say would be appreciated.

Following the discussion here, it seems that our existing sparse matrix DMatrix constructor was invalid because, while the Julia SparseMatrixCSC necessarily uses 0 as a default value, libxgboost effectively considers these values to be missing.

Furthermore, I've had to "hack" the getindex(::SparseMatrixCSR, ::Any) method to return a missing instead of a 0 as a default value. This allows for the possibility that if a user sets their own missing_value argument (as opposed to the default I've provided) they may construct a DMatrix with distinct values from the matrix they used to construct it. This does not seem ideal. The only alternatives to this I can think of would be:

  1. Do nothing other than add a documentation warning that if users set missing_value this can happen.
  2. Disallow users to set missing_value.
  3. Have the Julia DMatrix struct store a settable default value.

Of these I favor 1 as it seems like by far the simplest approach and does not compromise the utility of the DMatrix object. At this point I'm not even sure 3 is valid, I think probably not as it seems increasingly to me like libxgboost considers those values missing.

cc @ablaom @trivialfis

ExpandingMan commented 1 year ago

I can't really say how necessary it is, in most cases whatever data was used to construct the DMatrix is probably still around somewhere since the DMatrix is acting very much like an internal libxgboost library object. Indeed, in the Julia wrapper it is not necessary to pass a DMatrix anywhere, it will in all cases be automatically constructed from the provided AbstractMatrix.

My goal for this PR was to ensure that the DMatrix at least has correct values. Therefore, my latest changes in which I "hacked" the SparseMatrixCSR return to give missing instead of 0 seems to be correct.

I'll clean up docs and stuff and then this should be ready to merge, thanks for the clarification @trivialfis .

ExpandingMan commented 1 year ago

This PR should now be complete. I'll wait approximately 24 hours to merge it to give people a chance to object or comment.