dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.3k stars 8.73k forks source link

Allow users to set the logger level in XGBoost-PySpark #10065

Closed danmar3 closed 8 months ago

danmar3 commented 8 months ago

Hi, currently using XGBoost-PySpark in notebooks generates several log messages. I have not been able to turn them off. For example, when calling .transform, the notebook gets spammed with several messages like:

2024-02-22 08:44:49,270 INFO XGBoost-PySpark: predict_udf Do the inference on the CPUs
2024-02-22 08:44:49,275 INFO XGBoost-PySpark: predict_udf Do the inference on the CPUs
2024-02-22 08:44:49,310 INFO XGBoost-PySpark: predict_udf Do the inference on the CPUs
2024-02-22 08:44:49,403 INFO XGBoost-PySpark: predict_udf Do the inference on the CPUs
2024-02-22 08:44:49,411 INFO XGBoost-PySpark: predict_udf Do the inference on the CPUs
2024-02-22 08:44:49,535 INFO XGBoost-PySpark: predict_udf Do the inference on the CPUs
2024-02-22 08:44:49,603 INFO XGBoost-PySpark: predict_udf Do the inference on the CPUs
2024-02-22 08:44:49,616 INFO XGBoost-PySpark: predict_udf Do the inference on the CPUs

Currently, every time get_logger is called (here), the logger information is set to INFO here. This does not allow the user to set the logging level, as it is always re-set to 'INFO'.

I think this can be solved by removing the setLevel line here.

Thank you

trivialfis commented 8 months ago

Hmm, we need to find a way to unify all the logging levels.

trivialfis commented 8 months ago

There's XGB logging, Python logging, spark logging, among some others.

wbo4958 commented 8 months ago

Let me have a PR to fix this issue.

wbo4958 commented 8 months ago

Hi @danmar3, previously, it will print the Do the inference on the CPUs for every partition, which is really annoying. So I made https://github.com/dmlc/xgboost/pull/10077 to rework the log by putting the log showing on partition 0, which means there's only 1 line log printed for the inference. I think this is ok for debugging, especially for the GPU scenario, sometimes it will fall back to CPU due to the environment even though we have manually set it to use GPU.