Open hcho3 opened 4 years ago
@thvasilo I was thinking of using dask and run benchmark locally in a big AWS machine, to make it easy to manage. But yes, it would be nice if you can put up your script in a separate directly (cluster
). The more the merrier.
@hcho3 I have an initial set of scripts for running dask benchmarks. But I use cuDF the the primary backend for data handling here: https://github.com/trivialfis/dxgb_bench I will add more datasets to it as progressing.
It can be extended with other backends like CPU dask or just pandas. Would you like to take a look and see if it's suitable for merging it here?
@trivialfis I will take a look, thanks! Is it fair to assume that dask will have same performance characteristics as the underlying native distributed algorithm? My impression of dask is that it is a lightweight cluster application.
Also would be good to have distributed benchmark suite on Kubernetes cluster using XGBoost Operator if anyone is interested in contributing: https://github.com/kubeflow/xgboost-operator
Yes. But it will have higher memory consumption due to pandas and partition management.
I could try adding this through the scripts I have available currently.
My setup requires that the user has AWS credentials already set up (through aws-cli or as env vars I think).
Also, currently I much prefer using aws-parallelcluster but that involves running XGBoost communication over SLURM and not YARN.
If we need YARN I'd have to go back and ensure that it works as expected, or I guess we could have a Spark-based benchmark, that I assume works fine still.