dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
25.78k stars 8.69k forks source link

Fairness-aware classification with XGBoost #7282

Open s-ravichandran opened 2 years ago

s-ravichandran commented 2 years ago

Hi XGBoost community!

I'd like to add the implementation for our paper FairXGBoost: Fairness-aware Classification in XGBoost by way of a custom objective function. I'd like to add the functionality through an additional fair_classification_obj.cc file in order to provide support for distributed execution as well (which, AFAIK, is not possible with the python custom objective implementation)

The catch here is that the custom objective function requires access to the sensitive feature (such as race/gender). I'd like to start out with the straightforward implementation where I add an extra field to the MetaInfo class and subsequently read the values into it through either the CSV parser (for CSV files) or through the MetaTryLoadFloatInfo method, with a separate file containing the sensitive features. I've tested the latter method and it seems to work fine (until and unless you add arguments to the file name such as agaricus.txt.train?indexing_mode=1. The former method involving the CSV parser seems to require changes to the parser (which would end up changing DMLC-Core) - not sure if that is something that is acceptable.

At this point, I am not concerned about the privacy of the sensitive features. However, I wanted to check with the community if there is a way to design this so that in the future, we could extend this to a setting where XGBoost can privately access the sensitive features (either through cryptographic methods or through differential privacy).

TLDR:

  1. Want to implement this objective into the repo - what's the best way to store and handle the sensitive features?
  2. If reading sensitive features from CSV files is to be supported, can I make changes to dmlc-core?

Thanks!

hcho3 commented 2 years ago

@s-ravichandran Thanks for starting this discussion. FairXGBoost appears to be a feature that would be useful for many users.

If the goal is to enable distributed training (via Dask), I would highly recommend passing the extra MetaInfo as a Dask array or series. When using distributed training, you will not have access to the MetaTryLoadFloatInfo method that reads from a separate file.

Example (adapted from cpu_survival.py):

dtrain = DaskDMatrix(client, X, label=y, sensitive_feature=s)

where s is a Dask array or series.

hcho3 commented 2 years ago

we could extend this to a setting where XGBoost can privately access the sensitive features (either through cryptographic methods or through differential privacy).

Currently, XGBoost does not offer ability to access MetaInfo vectors in a secure way. This will be a major effort.

hcho3 commented 2 years ago

An alternative is to enable distributed training via Dask using a custom objective function. The advantage of this alternative is as follows:

  1. Ability to implement secure access methods for sensitive features inside the custom objective function.
  2. No need to modify C++ code at all.

The disadvantage is that, if the GPU algorithm is selected (gpu_hist), custom objectives will cause suboptimal performance.

s-ravichandran commented 2 years ago

Hi @hcho3,

Thanks for your suggestions!

I feel that at this point, the advantages you mentioned definitely outweigh the disadvantage of not being able to use the GPU algorithm optimally.

To make sure I got it right, I'll list the changes that are needed. Please let me know if I'm missing something.

  1. Add sensitive_features to MetaInfo
  2. Add sensitive_features as a keyword argument for DaskDMatrix and DMatrix
  3. Add logic to copy sensitive_features through the core.set_info() method
  4. Implement the FairXGBoost objective and pass it as the obj argument to the train method
hcho3 commented 2 years ago

The steps look reasonable. Let me check how feasible it is to get custom objective working in Dask. There isn't any demo or example for this use case, so we should add one.

hcho3 commented 2 years ago

@s-ravichandran Keep in mind that putting the sensitive feature into a MetaInfo field will require us to keep the sensitive feature in memory. If this is not desired, your custom objective will need to fetch the sensitive feature from another source on the fly, without storing it in a DMatrix.

s-ravichandran commented 2 years ago

The steps look reasonable. Let me check how feasible it is to get custom objective working in Dask. There isn't any demo or example for this use case, so we should add one.

Yes, I just did a quick check of the code and looks like DaskDMatrix and DMatrix use the same internal method to train which takes obj as a parameter. But, i haven't checked through an implementation yet. Will try and make the custom_objective.py work for Dask arrays.

s-ravichandran commented 2 years ago

@s-ravichandran Keep in mind that putting the sensitive feature into a MetaInfo field will require us to keep the sensitive feature in memory. If this is not desired, your custom objective will need to fetch the sensitive feature from another source on the fly, without storing it in a DMatrix.

Yes, given that out objective function is of the form

CE(yhat,y) + epsilon * CE(yhat,s)

I figured it'd be desirable to hold s in memory as well. Or were you thinking of any other reason why it shouldn't be in memory?

hcho3 commented 2 years ago

@s-ravichandran Got it. So it's perfectly fine to hold s in memory.

trivialfis commented 2 years ago

Thanks for opening the issue. Dask does support custom objectives and we have tests for it. , also https://developer.nvidia.com/blog/accelerating-xgboost-on-gpu-clusters-with-dask/ might be helpful. The issue mostly comes from how to distribute the dask collection to each worker consistently. For existing fields, XGBoost handles them internally by extracting the partitions out for each input. The code is in DaskDMatrix instead of DMatrix.

trivialfis commented 2 years ago

Following the thread, my understanding is that there are 2 parts of input predictors (X_0, X_1) where one of them is considered sensitive. Both predictors are matrices instead of vectors since a plural form is used in the termsensitive_features (but a single s_i is used in the paper, I assume there can be multiple regularizers?). Since the data has to be in memory, users can concatenate them into one predictor X before training, the rest would be the same as normal data. The custom objective can distinguish the features based on column index:

# pseudo code, untested.
X_normal = dd.read_parquet()
X_sensitive = dd.read_parquet()
y = dd.read_parquet()
X = dd.merge(X_normal, X_sensitive)

feature_marker = X_normal.shape[1]

def is_sensitive(column_index: int) -> bool:
    return column_index >= feature_marker

Xy = DaskDMatrix(X, y)

def custom_obj_for_dask(predt: np.ndarray, dtrain: xgb.DMatrix):
    # Calculate the gradient, dtrain contains all the needed features, and we use `feature_marker` to distinguish the 2 different predictor matrices.
    grad, hess = ...
    return grad, hess

This is a bit more general but requires XGBoost to expose the content of DMatrix. There was a PR for converting DMatrix into CSR matrix before but stalled. Also, the conversion might not be efficient without some caching. Lastly, quantized DMatrix cannot be easily used.

If we indeed want to implement this in C++, we can think about how to design the feature_marker for maximum generality. For instance as a mask with the length of n_features.

trivialfis commented 2 years ago

From a personal perspective, I would love to have this feature in XGBoost. Some general thoughts: