Number of threads used may be more than number of processes

karanbudhraja-tgen commented 4 years ago

In its current implementation, the interface does not acknowledge the number of processes that Julia is being run with. By default, it uses all available CPU cores. This can be limited by using the parameter "nthread" during training. However, many Julia methods are mindful of the number of processes and limit themselves to that.

Perhaps it could be considered to pass this number of Julia processes value as the default argument during xgboost model training.

aviks commented 4 years ago

Can you elaborate on your use case? Are you running multiple Julia processes on one machine, and want the XGBoost training to not use all available cores? What is the downside to using nthreads? Not sure I understand exactly what you mean, sorry.

karanbudhraja-tgen commented 4 years ago

There isn't a downside to using nthreads, it is just that it is a mandate. When I set Julia's number of processes using -p, other libraries such as Distributed only use that many processes. However, XGBoost does not consider that value and uses all available cores on the machine (more than the number of processes I specified using -p). What I meant to suggest is that XGBoost limit itself also when we use -p when running Julia. This can probably be done by internally passing the supplied value for -p if any as nthreads to XGBoost.

I hope that helped clarify some more, but please let me know and I can try to explain further.

ablaom commented 4 years ago

You may also want to also consider the pure julia tree-boosting algorithm at EvoTrees.jl. Moving forward, it may be easier to get them to add missing features than to expose what you want from this wrapped version. Development is still fairly active there.

Just a suggestion.

karanbudhraja-tgen commented 4 years ago

Thank you for the suggestion! I had not considered EvoTrees.jl. It is great that they are active. I will check that implementation as well.

clouds56 commented 3 years ago

Hello, I'm on the other side. I think the XGBoost only takes 1 core on my macOS (10.15, core i5), the CPU usage is ~90% vs at the same time, python XGBoost package could exploit ~360%. Passing nthread=4 or not doesn't change the situation. Parameters like this

  (
    nrounds=1000,
    nthread=4,
    # verbosity=3,
    # slient=false,
    tree_method="hist",
    eta=0.1,
    subsample=0.8,
    max_bin=256,
    max_depth=7,
    reg_lambda=1,
    reg_alpha=1,
    gamma=1,
    min_child_weight=500,
  ),

version info

Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-6267U CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

[009559a3] XGBoost v1.1.1

ChristianMichelsen commented 2 years ago

I have the same issue as @clouds56, also on a MacBook Pro (Intel).

bobaronoff commented 1 year ago

This is an old thread but since it is active I add my two cents for others who may have questions about multi-threading on Mac OS with Apple Silicon. By default libxgboost appears to claim threads based on a machine's cores and operating system. Multi-threading on Mac requires using the OpenMP library to be used when compiling libxgboost.dylib. It was not until libxgboost version 1.6.0 that the library for OpenMP on Apple Silicon became available in supplied binaries.

If booster parameter 'nthread' is not specified then full number of 'available' threads are used. Setting the parameter serves only to restrict the number of threads below what libxgboost would select on its own. It looks like multi-threading in libxgboost is used mostly for DMatrix operations and predict but does have some improvement with training. I can confirm that on my Apple M1 Pro multi-threading is occurring as a default. Going from nthread 1 to 4 reduces training time about 50-60%; above 4 there is no improvement. I can not figure out how to get how many threads libxgboost is actually using by default but performance is consistent with at least 4. Updating my version of XGBoost.jl also updated me to the then most current version of libxgboost which is 1.7.3.

Therefore, on Apple Silicon, if your XGBoost.jl is up to date then you are getting the benefits of multi-threading. Using GPU on Apple Silicon is not available; only time will tell if/when it gets here - this is dependent on issues outside of XGBoost.jl.

If you wish to confirm the libxgboost version that XGBoost.jl is using, my code for this task is below:

function libxgboost_version()
    mj=Ref{Cint}(0)
    mn=Ref{Cint}(0)
    pt=Ref{Cint}(0)
    XGBoost.XGBoostVersion(mj, mn, pt)
    return(string(mj[])*"."*string(mn[])*"."*string(pt[]))
end

Moelf commented 1 year ago

running into this problem again. This package ignores julia -t 1 set which causes some kind of soft lock for this computing node I'm on

ExpandingMan commented 1 year ago

Currently if the number of threads is not explicitly set we do not pass any argument to either DMatrix or Booster, I guess whatever the library is doing is not an appropriate default. Can you confirm that passing the keyword args nthreads=1 to DMatrix and nthread=1 (yes, they are different) to Booster solves your issue? If so we can replace their defaults with Threads.nthreads().

Moelf commented 1 year ago

nthread=1 (yes, they are different) to Booster solves your issue?

works! should I make a PR?

ExpandingMan commented 1 year ago

Sure, thanks.

dmlc / XGBoost.jl

Number of threads used may be more than number of processes #88