dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.15k stars 8.71k forks source link

Data loss converting from scipy.sparse.csr_matrix to xgboost.DMatrix #10532

Open nickwood opened 3 months ago

nickwood commented 3 months ago

scipy csr matrices have an implied value of zero for all non specified values. However this isn't being respected when converting between the two i.e:

from scipy.sparse import csr_matrix
from xgboost import DMatrix

raw_data = [
    [1, 0, 2],
    [0, 3, 0],
    [4, 0, 0]
]
sp_csr = csr_matrix(raw_data)
dmatrix = DMatrix(raw_data)

print(dmatrix.get_data())
#   Coords        Values
#   (0, 0)        1.0
#   (0, 1)        0.0
#   (0, 2)        2.0
#   (1, 0)        0.0
#   (1, 1)        3.0
#   (1, 2)        0.0
#   (2, 0)        4.0
#   (2, 1)        0.0
#   (2, 2)        0.0

print(DMatrix(sp_csr).get_data())
#   Coords        Values
#   (0, 0)        1.0
#   (0, 2)        2.0
#   (1, 1)        3.0
#   (2, 0)        4.0

I would expect the two 'print' statements to return equivalent results, but the zeroes are instead being treated as missing values.

hcho3 commented 3 months ago

By default, DMatrix uses np.nan to indicate missing values. You can pass missing=0 to override this behavior.

from scipy.sparse import csr_matrix
from xgboost import DMatrix

raw_data = [
    [1, 0, 2],
    [0, 3, 0],
    [4, 0, 0]
]
sp_csr = csr_matrix(raw_data)
dmatrix = DMatrix(raw_data, missing=0)

# will print the identical output
print(dmatrix.get_data())
print(DMatrix(sp_csr).get_data())
trivialfis commented 3 months ago

Please note that 0 is perfectly valid data for decision trees. Scipy and many other CSR implementations consider 0 as missing.

I'm considering whether it's reasonable to restore these 0s in QuantileDMatrix (but not for DMatrix, which might be too expensive) to support existing sparse matrix implementation.

nickwood commented 3 months ago

These values aren't missing however - they are implicitly zero: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix

Consider instead a situation where our input contains both NaNs and zeroes. These mean very different things in our input data. However by converting to a DMatrix (in order to train a classifier for example) we lose the distinction between the two:

import numpy as np

from scipy.sparse import csr_matrix
from xgboost import DMatrix

raw_data = [
    [1, 0, 2],
    [0, 3, np.nan],
    [4, 0, np.nan]
]

sp_csr = csr_matrix(raw_data)
dmatrix = DMatrix(raw_data)

print(dmatrix.get_data())
#   Coords        Values
#   (0, 0)        1.0
#   (0, 1)        0.0
#   (0, 2)        2.0
#   (1, 0)        0.0
#   (1, 1)        3.0
#   (2, 0)        4.0
#   (2, 1)        0.0
# This output is as I'd expect

print(DMatrix(sp_csr).get_data())

#   Coords        Values
#   (0, 0)        1.0
#   (0, 2)        2.0
#   (1, 1)        3.0
#   (2, 0)        4.0
# By converting from CSR we have no way of knowing which gaps are 'zero' and which are 'NaN'

Note also that specifying a missing parameter doesn't help us - the outcome is the same:

print(DMatrix(sp_csr, missing=0).get_data())

#   Coords        Values
#   (0, 0)        1.0
#   (0, 2)        2.0
#   (1, 1)        3.0
#   (2, 0)        4.0
hcho3 commented 3 months ago

@nickwood This is a limitation of XGBoost. For now, you will need to express the data as a dense matrix to retain the difference between NaN and (non-missing) 0.

@trivialfis It might indeed be useful to create an option where 0 is restored as non-missing value inside the DMatrix.