adap / flower

Flower: A Friendly Federated AI Framework
https://flower.ai
Apache License 2.0
5.08k stars 875 forks source link

Implementing XGBoost is impossible with NumpyClient #969

Open hugolytics opened 2 years ago

hugolytics commented 2 years ago

Hi Y'all,

This project is really cool, but it should also support decision trees due to their popularity and applications. I would like to contribute a working XGboost implementation. However, this is not possible with the numpy client and FedAvg strategy, since the model learned by the decision tree boosters in XGboost, lightGBM, and other frameworks cannot be represented by a fixed array of coefficients, and therefore cannot be communicated in terms of numpy arrays.

There are two options w.r.t communicating the learned model

It is possible to store the decision rules simply as raw bytes and send these bytes through the gRPC protocol. One can extract the learned booster and convert this to a dataframe representation, however this is extremely impractical due to the size.

However, federated computation of the coefficients is not a trivial task, as the model structure is not fixed one cannot simply average the learned model coefficients. Instead, aggegration needs to be performed at several steps in the learning process, this is partly reflected in the distributed version of XGboost implemented on Dask.

This paper describes a federated approach to XGboost, would that implementation be feasible on flower?

I would like to contribute a working XGboost implementation on Flower, but it should also fit within the scope of the library and the abstractions provided in it.

sisco0 commented 2 years ago

It would be good to know why we could not go from the DMatrix XGBoost custom data format to Numpy ndarray. Would the https://github.com/aporia-ai/dmatrix2np library solve this issue and would ease the XGBoost integration into our stack?

hugolytics commented 2 years ago

Converting the DMatrix to numpy ndarray would not solve the problem because the data is stored in a DMatrix object it does not contain the trained model, in a federated learning setting the data is not passed around the nodes but the trained model coefficients are.

To illustrate what needs to be communicated between nodes consider the following example code:

from sklearn.datasets import load_iris
from xgboost import XGBClassifier

iris = load_iris(as_frame=True)
X, y = iris.data, iris.target 
# print(X.describe())
clf = XGBClassifier(booster='gbtree')
clf.fit(X,y) 

A trained decision tree does not yield a fixed coefficients matrix, instead, it yields a set of decision rules. After calling the fit method, the clf object contains the trained xgboost model. The xgboost API exposes two methods to access this trained model for saving and loading:

  1. The booster object itself:
booster = clf.get_booster()

This yields a xgboost.core.Booster object Which can be saved two ways:

The XGboost API does not expose a method to convert this dataframe representation back to a booster model that can be loaded again. It however does yield a matrix of fixed with 11 columns: Tree Node ID Feature ... Missing Gain Cover Category, and many rows (in this small example 1250 rows (even though X is 150 rows x 4 columns). This number of rows is problematic since its encoding is extremely inefficient (it is not meant to be efficient, its purpose is for illustration) and it explodes with larger X and more trees. But it would make developing a strategy easier.

The raw booster however can be directly loaded and used to predict.

clf.load_model(raw_booster)
preds = clf.predict(X)

But the raw booster is a bytearray object, difficult to parse. It is not straightforward to develop a strategy for learning the rules in a federated setting using this object, but it is the most efficient format for communicating the current model between the nodes.

  1. The save_model function: This method can dump a learned xgboost model to disk in a JSON format, we can [capture this model in memory](https://github.com/dmlc/xgboost/blob/master/tests/python/test_basic_models.py#:~:text=cls.get_booster().best_ntree_limit-,with%20tempfile.TemporaryDirectory()%20as%20tmpdir%3A,-path%20%3D%20os.path) as follows:
import os, tempfile,json

def get_model(model:XGBClassifier)->dict:

  with tempfile.TemporaryDirectory() as tmpdir:
    path = os.path.join(tmpdir, "model.json")
    model.save_model(path)
    with open(path, "r") as fd:
        return json.load(fd)

model = get_model(clf)

The trained xgboost model is serialized according to a JSON schema and yields a dictionary. This model dictionary contains the version of xgboost and the learner. The information about the learner (model['learner']) is specified in terms of it's 'attributes', 'feature_names', 'feature_types', 'gradient_booster', 'learner_model_param', 'objective'.

model['learner']['gradient_booster'] contains the information that the xgboost classifier actually learns over the coarse of the training rounds. This is model contained in model['learner']['gradient_booster']['model'] in terms of 'gbtree_model_param', 'tree_info', 'trees'. The 'trees' describe the learned decision rules in the booster.

pd.DataFrame(model['learner']['gradient_booster']['model']['trees'])

Yields 300 rows x 16 columns The columns are 'base_weights', 'categories', 'categories_nodes', 'categories_segments', 'categories_sizes', 'default_left', 'id', 'left_children', 'loss_changes', 'parents', 'right_children', 'split_conditions', 'split_indices', 'split_type', 'sum_hessian', 'tree_param'

This model JSON file can also be loaded as a model. So it seems to me this is the most logical format to implement federated learning with. However, its content cannot be communicated in terms of just NumPy arrays since it is a dictionary. In addition, training a booster model is not as simple as passing around a matrix of model coefficients and aggregating them. It requires decision rules and split criteria to be communicated at several points in a single training round. As mentioned in this paper about a federated approach to XGboost.

So in order to implement XGboost on flower, the communication protocol needs to accommodate JSON (or some other format in which the decision rules can be expressed), as such the NumpyClient does not suffice. Secondly, more information than just a set of coefficients needs to be communicated (as illustrated in the aforementioned paper), since the proposed split criteria need to be evaluated on the nodes' datasets for building the tree. They propose the following algorithm: Screenshot 2021-12-30 at 15 57 58

This could be accommodated, with minimal changes in the API of XGBoost and flower, by implementing a new Strategy for flower and using the Callbacks functionality of Xgboost. The Callback functionality allows one to intercept model parameters during the training round, this could be communicated through JSON or bytearray.

Any suggestions on how to move forward?

sisco0 commented 2 years ago

It is certain that XGBoost bases itself on using JSON instead of raw bytes for import-export capabilities of the models when using the C bindings as shown in their repository. This could lead to a better integration with C clients for Flower.

Storing the decision rules Booster class instance as raw bytes and transferring a compressed version of it by means of a compressed message using gRPC API is could be even more efficient in communication times. The number of rounds run at client side could be updated for reducing these delays in the overall training timings. For getting that compression to work we have these functions available: grpc::ClientContext::set_compression_algorithm and associated compression levels could be used at the client side while grpc::ServerBuilder::SetDefaultCompressionAlgorithm could be used at the server side. But this would not be compatible with the C bindings operations from the official XGBoost repository, so if we would go on with C-based clients we could not move on this topic (only Python clients would be allowed).

I think it is a good thing that we would use JSON instead of the raw bytes Booster object taking into account a wider range of client implementation capabilities, for sure we should care about the compression used in the communication for this purpose. Maybe the JSON object could be transformed into a Numpy array at first instance for using the current communication capabilities by just using np.asarray(jsonObject). What do you think? Would this be possible?

danieljanes commented 2 years ago

Thanks for the question @hugolytics and thanks for the idea @sisco0!

First off, I need to start with a disclaimer: I'm not familiar with XGBoost and I don't have the bandwidth right now to go through the paper. That being said, there are many details in the previous message that I can comment on.

On the lowest level, Flower communicates byte arrays between server and clients. This means that as long as you can serialize/save your model to (and deserialize/restore your model from) a set of byte arrays, Flower can send/receive your model.

Users new to Flower usually encounter NumPyClient first - which is intentional because it's the easiest way to get started if your framework can provides good support for NumPy (which many frameworks do). It's important to note however that NumPyClient is just a convenience abstraction built on top of the lower-level Client class. Client works in a very similar way to NumPyClient, but it requires you to write a bit more boilerplate, namely for serialization and deserialization of your model.

So the short answer is, yes, you can send/receive byte arrays, and you could also send/receive the JSON representation if you serialize it to a byte array (another way would be to misappropriate the config dict to directly send/receive JSON strings, but it'd be a bit hacky). To handle these values on the server-side, one would have to either customize an existing strategy or implement a strategy from scratch.

The second question tackled communication rounds and aggregation. The currently exposed Flower API's take a round-based perspective, so the server (via the Strategy) performs the following steps repeatedly:

  1. Ask the strategy to configure a set of clients via configure_fit (i.e., send model parameters and configuration values to the clients, can be different for each client)
  2. Wait for clients to return their results
  3. Ask the strategy to aggregate the results (via aggregate_fit)
  4. Repeat for evaluation (configure_evaluate/aggregate_evaluate)

If the intended federation approach can be mapped to this process, then it shouldn't be a huge effort to implement it in Flower. Happy to provide more details on any of the previously mentioned topics, feel free to open a PR even if it's not fully working yet, perhaps we can help to get it across the finishing line. Would be great to have an XGBoost example for Flower!

sisco0 commented 2 years ago

Maybe the road to go is to create an examples/quickstart_xgboost/client.py file similar to the examples/embedded_devices/client.py file which already contains a client based on fl.client.Client. Added to that, we could add a new strategy at src/py/flwr/server/strategy/fedxgb.py (with its corresponding _test.py file) basing itself on the src/py/flwr/server/strategy/fedavg.py structure. After that, we could go on and create the examples/quickstart_xgboost/server.py file with README.md and get that into a Poetry project as in examples/pytorch_federated_variational_autoencoder.

hugocool commented 2 years ago

Okay,

So if I understand correctly using the JSON schema would be the preferred communication medium for trained model params. A new strategy should support different boosting libraries, not just XGboost, but also LightGBM, Catboost and others.

So my proposal would be that I open a PR, and in accordance with @sisco0 suggestion

Then I'll develop a strategy, preferably fedgbm (gbm stands for gradient boosting machine) based on the src/py/flwr/server/strategy/fedavg.py structure, in order to provide general support to boosting libraries, but ill focus on xgboost first.

I think at that moment we should regroup and consider the API, whether we're providing the right level of abstraction while also considering how will we handle supporting other boosting libraries, two-step models (such as double ML, treatment effect boosting, survival analysis, etc), crossvalidated hyperparameter tuning, etc.

sisco0 commented 2 years ago

Your proposal looks good to me. Please @danieljanes, could you take a look and double check the proposed file structure and workflow in its correctness? 😀