dmlc / xgboost-bench

4 stars 4 forks source link

Add distributed training #2

Open hcho3 opened 4 years ago

thvasilo commented 4 years ago

I could try adding this through the scripts I have available currently.

My setup requires that the user has AWS credentials already set up (through aws-cli or as env vars I think).

Also, currently I much prefer using aws-parallelcluster but that involves running XGBoost communication over SLURM and not YARN.

If we need YARN I'd have to go back and ensure that it works as expected, or I guess we could have a Spark-based benchmark, that I assume works fine still.

hcho3 commented 4 years ago

@thvasilo I was thinking of using dask and run benchmark locally in a big AWS machine, to make it easy to manage. But yes, it would be nice if you can put up your script in a separate directly (cluster). The more the merrier.

trivialfis commented 4 years ago

@hcho3 I have an initial set of scripts for running dask benchmarks. But I use cuDF the the primary backend for data handling here: https://github.com/trivialfis/dxgb_bench I will add more datasets to it as progressing.

It can be extended with other backends like CPU dask or just pandas. Would you like to take a look and see if it's suitable for merging it here?

hcho3 commented 4 years ago

@trivialfis I will take a look, thanks! Is it fair to assume that dask will have same performance characteristics as the underlying native distributed algorithm? My impression of dask is that it is a lightweight cluster application.

terrytangyuan commented 4 years ago

Also would be good to have distributed benchmark suite on Kubernetes cluster using XGBoost Operator if anyone is interested in contributing: https://github.com/kubeflow/xgboost-operator

trivialfis commented 4 years ago

Yes. But it will have higher memory consumption due to pandas and partition management.