Open nickwood opened 3 months ago
By default, DMatrix
uses np.nan
to indicate missing values. You can pass missing=0
to override this behavior.
from scipy.sparse import csr_matrix
from xgboost import DMatrix
raw_data = [
[1, 0, 2],
[0, 3, 0],
[4, 0, 0]
]
sp_csr = csr_matrix(raw_data)
dmatrix = DMatrix(raw_data, missing=0)
# will print the identical output
print(dmatrix.get_data())
print(DMatrix(sp_csr).get_data())
Please note that 0
is perfectly valid data for decision trees. Scipy and many other CSR implementations consider 0 as missing.
I'm considering whether it's reasonable to restore these 0s in QuantileDMatrix (but not for DMatrix, which might be too expensive) to support existing sparse matrix implementation.
These values aren't missing however - they are implicitly zero: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix
Consider instead a situation where our input contains both NaNs and zeroes. These mean very different things in our input data. However by converting to a DMatrix (in order to train a classifier for example) we lose the distinction between the two:
import numpy as np
from scipy.sparse import csr_matrix
from xgboost import DMatrix
raw_data = [
[1, 0, 2],
[0, 3, np.nan],
[4, 0, np.nan]
]
sp_csr = csr_matrix(raw_data)
dmatrix = DMatrix(raw_data)
print(dmatrix.get_data())
# Coords Values
# (0, 0) 1.0
# (0, 1) 0.0
# (0, 2) 2.0
# (1, 0) 0.0
# (1, 1) 3.0
# (2, 0) 4.0
# (2, 1) 0.0
# This output is as I'd expect
print(DMatrix(sp_csr).get_data())
# Coords Values
# (0, 0) 1.0
# (0, 2) 2.0
# (1, 1) 3.0
# (2, 0) 4.0
# By converting from CSR we have no way of knowing which gaps are 'zero' and which are 'NaN'
Note also that specifying a missing parameter doesn't help us - the outcome is the same:
print(DMatrix(sp_csr, missing=0).get_data())
# Coords Values
# (0, 0) 1.0
# (0, 2) 2.0
# (1, 1) 3.0
# (2, 0) 4.0
@nickwood This is a limitation of XGBoost. For now, you will need to express the data as a dense matrix to retain the difference between NaN and (non-missing) 0.
@trivialfis It might indeed be useful to create an option where 0
is restored as non-missing value inside the DMatrix.
scipy csr matrices have an implied value of zero for all non specified values. However this isn't being respected when converting between the two i.e:
I would expect the two 'print' statements to return equivalent results, but the zeroes are instead being treated as missing values.