Open s-ravichandran opened 2 years ago
@s-ravichandran Thanks for starting this discussion. FairXGBoost appears to be a feature that would be useful for many users.
If the goal is to enable distributed training (via Dask), I would highly recommend passing the extra MetaInfo as a Dask array or series. When using distributed training, you will not have access to the MetaTryLoadFloatInfo
method that reads from a separate file.
Example (adapted from cpu_survival.py):
dtrain = DaskDMatrix(client, X, label=y, sensitive_feature=s)
where s
is a Dask array or series.
we could extend this to a setting where XGBoost can privately access the sensitive features (either through cryptographic methods or through differential privacy).
Currently, XGBoost does not offer ability to access MetaInfo vectors in a secure way. This will be a major effort.
An alternative is to enable distributed training via Dask using a custom objective function. The advantage of this alternative is as follows:
The disadvantage is that, if the GPU algorithm is selected (gpu_hist
), custom objectives will cause suboptimal performance.
Hi @hcho3,
Thanks for your suggestions!
I feel that at this point, the advantages you mentioned definitely outweigh the disadvantage of not being able to use the GPU algorithm optimally.
To make sure I got it right, I'll list the changes that are needed. Please let me know if I'm missing something.
sensitive_features
to MetaInfo
sensitive_features
as a keyword argument for DaskDMatrix
and DMatrix
sensitive_features
through the core.set_info()
methodobj
argument to the train methodThe steps look reasonable. Let me check how feasible it is to get custom objective working in Dask. There isn't any demo or example for this use case, so we should add one.
@s-ravichandran Keep in mind that putting the sensitive feature into a MetaInfo field will require us to keep the sensitive feature in memory. If this is not desired, your custom objective will need to fetch the sensitive feature from another source on the fly, without storing it in a DMatrix.
The steps look reasonable. Let me check how feasible it is to get custom objective working in Dask. There isn't any demo or example for this use case, so we should add one.
Yes, I just did a quick check of the code and looks like DaskDMatrix
and DMatrix
use the same internal method to train which takes obj
as a parameter. But, i haven't checked through an implementation yet. Will try and make the custom_objective.py
work for Dask arrays.
@s-ravichandran Keep in mind that putting the sensitive feature into a MetaInfo field will require us to keep the sensitive feature in memory. If this is not desired, your custom objective will need to fetch the sensitive feature from another source on the fly, without storing it in a DMatrix.
Yes, given that out objective function is of the form
CE(yhat,y) + epsilon * CE(yhat,s)
I figured it'd be desirable to hold s
in memory as well. Or were you thinking of any other reason why it shouldn't be in memory?
@s-ravichandran Got it. So it's perfectly fine to hold s
in memory.
Thanks for opening the issue. Dask does support custom objectives and we have tests for it. , also https://developer.nvidia.com/blog/accelerating-xgboost-on-gpu-clusters-with-dask/ might be helpful. The issue mostly comes from how to distribute the dask collection to each worker consistently. For existing fields, XGBoost handles them internally by extracting the partitions out for each input. The code is in DaskDMatrix
instead of DMatrix
.
Following the thread, my understanding is that there are 2 parts of input predictors (X_0, X_1) where one of them is considered sensitive. Both predictors are matrices instead of vectors since a plural form is used in the termsensitive_features
(but a single s_i
is used in the paper, I assume there can be multiple regularizers?). Since the data has to be in memory, users can concatenate them into one predictor X
before training, the rest would be the same as normal data. The custom objective can distinguish the features based on column index:
# pseudo code, untested.
X_normal = dd.read_parquet()
X_sensitive = dd.read_parquet()
y = dd.read_parquet()
X = dd.merge(X_normal, X_sensitive)
feature_marker = X_normal.shape[1]
def is_sensitive(column_index: int) -> bool:
return column_index >= feature_marker
Xy = DaskDMatrix(X, y)
def custom_obj_for_dask(predt: np.ndarray, dtrain: xgb.DMatrix):
# Calculate the gradient, dtrain contains all the needed features, and we use `feature_marker` to distinguish the 2 different predictor matrices.
grad, hess = ...
return grad, hess
This is a bit more general but requires XGBoost to expose the content of DMatrix
. There was a PR for converting DMatrix
into CSR matrix before but stalled. Also, the conversion might not be efficient without some caching. Lastly, quantized DMatrix cannot be easily used.
If we indeed want to implement this in C++, we can think about how to design the feature_marker
for maximum generality. For instance as a mask with the length of n_features
.
From a personal perspective, I would love to have this feature in XGBoost. Some general thoughts:
X
. The extra field in metainfo needs only be a feature-wise marker/mask like the feature weight instead of a full-blown matrix. ~(actually, can we reuse the feature weight?)~
Hi XGBoost community!
I'd like to add the implementation for our paper FairXGBoost: Fairness-aware Classification in XGBoost by way of a custom objective function. I'd like to add the functionality through an additional
fair_classification_obj.cc
file in order to provide support for distributed execution as well (which, AFAIK, is not possible with the python custom objective implementation)The catch here is that the custom objective function requires access to the sensitive feature (such as race/gender). I'd like to start out with the straightforward implementation where I add an extra field to the MetaInfo class and subsequently read the values into it through either the CSV parser (for CSV files) or through the
MetaTryLoadFloatInfo
method, with a separate file containing the sensitive features. I've tested the latter method and it seems to work fine (until and unless you add arguments to the file name such asagaricus.txt.train?indexing_mode=1
. The former method involving the CSV parser seems to require changes to the parser (which would end up changing DMLC-Core) - not sure if that is something that is acceptable.At this point, I am not concerned about the privacy of the sensitive features. However, I wanted to check with the community if there is a way to design this so that in the future, we could extend this to a setting where XGBoost can privately access the sensitive features (either through cryptographic methods or through differential privacy).
TLDR:
Thanks!