dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.31k stars 8.73k forks source link

Fail to install xgboost R package from source inside conda environment under MacOS (big sur) #7017

Closed AlfredSAM closed 2 years ago

AlfredSAM commented 3 years ago

Hello! Xgboost only uses ONE thread (core) under MacOS if it is installed using general install.packages("xgboost"). The solution is to install xgboost R package from source as indicated in https://xgboost.readthedocs.io/en/latest/build.html#installing-the-development-version-linux-mac-osx. Of coz, I also install

brew install libomp

at first, and then follow the above instructions to build the xgboost R package from source. I successfully install it with multi-threads work, but ONLY for the system-wide R.

I fail to install xgboost R package from source inside conda environment under MacOS (big sur).

In order to conduct the tests using different versions of R, conda environments are usually constructed to install R and related packages separated from the system-wide R. For example, I just use the following R_4_mkl.yml to construct the conda environment:

name: R_4.0_mkl
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.8
  - conda-forge::r-base=4.1.0
  - conda-forge::libblas=3.9.0=9_mkl

In terminal, just input the following to construct the new conda environment and then activate:

conda env create -f R_4_mkl.yml
conda activate R_4.0_mkl

Next, I would like to install xgboost R package inside this conda environment. Well, I just slightly revise the FindLibR.cmake in xgboost/cmake/modules to allow the user to setup the proper executable R path. Please check the revised file:

FindLibR.cmake.zip

and the key part is

# detection for OSX
if(APPLE)

  find_library(LIBR_LIBRARIES R)

  if(NOT LIBR_EXECUTABLE)

    if(LIBR_LIBRARIES MATCHES ".*\\.framework")
      set(LIBR_HOME "${LIBR_LIBRARIES}/Resources" CACHE PATH "R home directory")
      set(LIBR_INCLUDE_DIRS "${LIBR_HOME}/include" CACHE PATH "R include directory")
      set(LIBR_EXECUTABLE "${LIBR_HOME}/R" CACHE PATH "R executable")
      set(LIBR_LIB_DIR "${LIBR_HOME}/lib" CACHE PATH "R lib directory")
    else()
      get_filename_component(_LIBR_LIBRARIES "${LIBR_LIBRARIES}" REALPATH)
      get_filename_component(_LIBR_LIBRARIES_DIR "${_LIBR_LIBRARIES}" DIRECTORY)
      set(LIBR_EXECUTABLE "${_LIBR_LIBRARIES_DIR}/../bin/R")
      execute_process(
        COMMAND ${LIBR_EXECUTABLE} "--slave" "--vanilla" "-e" "cat(R.home())"
        OUTPUT_VARIABLE LIBR_HOME)
      set(LIBR_HOME ${LIBR_HOME} CACHE PATH "R home directory")
      set(LIBR_INCLUDE_DIRS "${LIBR_HOME}/include" CACHE PATH "R include directory")
      set(LIBR_LIB_DIR "${LIBR_HOME}/lib" CACHE PATH "R lib directory")
    endif()
  else()
      execute_process(
        COMMAND ${LIBR_EXECUTABLE} "--slave" "--vanilla" "-e" "cat(R.home())"
        OUTPUT_VARIABLE LIBR_HOME)
      set(LIBR_HOME ${LIBR_HOME} CACHE PATH "R home directory")
      set(LIBR_INCLUDE_DIRS "${LIBR_HOME}/include" CACHE PATH "R include directory")
      set(LIBR_LIB_DIR "${LIBR_HOME}/lib" CACHE PATH "R lib directory")
      set(LIBR_CORE_LIBRARY, "${LIBR_HOME}/lib/libR.dylib")
  endif()

Now I just follow

cd xgboost
git submodule init
git submodule update
mkdir build
cd build
cmake .. -DR_LIB=ON -D${executable R path}
make
make install

where ${executable R path} is the result of which R inside the conda environment. The procedure fails in the final step make install:

Command: /Users/alfredfaisam/opt/miniconda3/envs/R_4.0_mkl/bin/R -q -e deps = setdiff(c('data.table', 'jsonlite', 'Matrix'), rownames(installed.packages()))\  if(length(deps)>0) install.packages(deps, repo = 'https://cloud.r-project.org/')
Command: /Users/alfredfaisam/opt/miniconda3/envs/R_4.0_mkl/bin/R CMD INSTALL --no-multiarch --build /Users/alfredfaisam/Desktop/LearningR/xgboost/build/R-package-install/R-package
CMake Error at RPackageInstall.cmake:16 (message):
  out: Error: package or namespace load failed for ‘xgboost’ in
  library.dynam(lib, package, package.lib):

   shared object ‘xgboost.dylib’ not found

  Error: loading failed

  Execution halted

  , err: * installing to library
  ‘/Users/alfredfaisam/opt/miniconda3/envs/R_4.0_mkl/lib/R/library’

  * installing *source* package ‘xgboost’ ...

  ** using staged installation

  ** libs

  ** R

  ** data

  ** demo

  ** inst

  ** byte-compile and prepare package for lazy loading

  ** help

  *** installing help indices

  ** building package indices

  ** installing vignettes

  ** testing if installed package can be loaded from temporary location

  ERROR: loading failed

  * removing
  ‘/Users/alfredfaisam/opt/miniconda3/envs/R_4.0_mkl/lib/R/library/xgboost’

  , res: 1
Call Stack (most recent call first):
  RPackageInstall.cmake:34 (check_call)
  cmake_install.cmake:102 (include)

make: *** [install] Error 1

Well, could you please help check this issue, or any suggestions about installing xgboost R package from source inside conda environment under MacOS?

r$> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Big Sur 11.3

Matrix products: default
BLAS/LAPACK: /Users/alfredfaisam/opt/miniconda3/envs/R_4.0_mkl/lib/libmkl_rt.dylib

locale:
[1] en_US.UTF-8/UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.1.0
AlfredSAM commented 3 years ago

Supplements for the installation with the revised FindLibR.cmake are as follows:

cd xgboost
git submodule init
git submodule update
mkdir build
cd build
cmake .. -DR_LIB=ON -DLIBR_EXECUTABLE=/Users/alfredfaisam/opt/miniconda3/envs/R_4.0_mkl/bin/R
make
make install

Sorry for the incomplete information.

hcho3 commented 3 years ago

Conda-forge provides the XGBoost R package: https://anaconda.org/conda-forge/r-xgboost-cpu. Can you try installing it with conda install -c conda-forge r-xgboost-cpu?

hcho3 commented 3 years ago

Also, according to https://xgboost.readthedocs.io/en/latest/install.html#r, it should be sufficient to run install.packages("xgboost"). It should make use of libomp automatically.

asam46 commented 3 years ago

Thanks @hcho3 ! Well, actually I tried both methods before but they fail to employ multi threads in use. Let's check the benchmarks using the system-wide R with xgboost R package built from source. As expected, multi threads are allowed to work as follows:

r$> require(xgboost)
    x <- matrix(rnorm(100 * 10000), 10000, 100)
    y <- x %*% rnorm(100) + rnorm(1000)

    system.time({
      bst <- xgboost(data = x, label = y, nthread = 1, nround = 100, verbose =
     F)
    })
Loading required package: xgboost
   user  system elapsed
 19.075   0.098  16.838

r$> system.time({
      bst <- xgboost(data = x, label = y, nthread = 4, nround = 100, verbose =
     F)
    })
   user  system elapsed
 17.640   0.081   4.486

On the other hand, for some conda based on

name: R_4.0_mkl
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.8
  - conda-forge::r-base=4.1.0
  - conda-forge::libblas=3.9.0=9_mkl

installing xgboost using conda install -c conda-forge r-xgboost-cpu CANNOT allow for multi-threads

r$> require(xgboost)
    x <- matrix(rnorm(100 * 10000), 10000, 100)
    y <- x %*% rnorm(100) + rnorm(1000)

    system.time({
      bst <- xgboost(data = x, label = y, nthread = 1, nround = 100, verbose = F)
    })
Loading required package: xgboost
   user  system elapsed
 17.161   0.063  16.618

r$> system.time({
      bst <- xgboost(data = x, label = y, nthread = 4, nround = 100, verbose = F)
    })
   user  system elapsed
 16.791   0.046  16.877

Let's check another conda environment generated by the same .yml file but installing xgboost using install.packages("xgboost"). The first interesting point is that during the installation process I notice:

checking whether OpenMP will work in a package... no
*****************************************************************************************
         OpenMP is unavailable on this Mac OSX system. Training speed may be suboptimal.
         To use all CPU cores for training jobs, you should install OpenMP by running

             brew install libomp
*****************************************************************************************

even though I have installed ti via

brew install libomp

Therefore, the results are not surprising:

r$> require(xgboost)
    x <- matrix(rnorm(100 * 10000), 10000, 100)
    y <- x %*% rnorm(100) + rnorm(1000)

    system.time({
      bst <- xgboost(data = x, label = y, nthread = 1, nround = 100, verbose = F)
    })
Loading required package: xgboost
   user  system elapsed
 17.789   0.058  17.284

r$> system.time({
      bst <- xgboost(data = x, label = y, nthread = 4, nround = 100, verbose = F)
    })
   user  system elapsed
 17.165   0.044  17.251

It seems that installation from source for the conda environment under MacOS is necessary to allow for multi-threads, but just need some revisions.

hcho3 commented 3 years ago

Got it. I'm out of ideas. The OpenMP support in MacOS has been a sticky point for a while and even with libomp there are some use cases that falls through the crack, such as yours. Feel free to share your insights once you figure something out.

asam46 commented 3 years ago

After several trials, I figure out a method to solve this problem even though it is not that elegant. First, after the installation of libomp via

brew install libomp

OpenMP should be available for MacOS, so that for system-wide R installation of xgboost from source can successfully make multi-threads available. Therefore, the problems should be the compilation process when installing xgboost within the conda environment. Inspired by this post, I try to check the file path using the following command in R console within the conda environment:

file.path(R.home("etc"), "Makeconf")

Using vim to examine this file in the above returned path, I notice that this file is within the path of corresponding conda environment, and the following are set

SHLIB_OPENMP_CFLAGS = -fopenmp
SHLIB_OPENMP_CXXFLAGS = -fopenmp
SHLIB_OPENMP_FFLAGS = -fopenmp

However, the following are blank

SHLIB_CFLAGS = 
SHLIB_CXXFLAGS = 
SHLIB_FFLAGS = 

Unfortunately, when using install.packages("xgboost") in the R console within conda environment, I cannot find -fopenmp is employed as effective flags for compilation. Therefore, I just revise the above file to set and save

SHLIB_CFLAGS = -fopenmp
SHLIB_CXXFLAGS = -fopenmp
SHLIB_FFLAGS = -fopenmp 

Now, just useinstall.packages("xgboost") in the R console within conda environment. As before, the information

checking whether OpenMP will work in a package... no
*****************************************************************************************
         OpenMP is unavailable on this Mac OSX system. Training speed may be suboptimal.
         To use all CPU cores for training jobs, you should install OpenMP by running

             brew install libomp
*****************************************************************************************

is still shown up. However, during the compilation process, -fopenmp is found as the effective flag. After the installation, I find that multi-threads available now:

r$> require(xgboost)
    x <- matrix(rnorm(100 * 10000), 10000, 100)
    y <- x %*% rnorm(100) + rnorm(1000)

    system.time({
      bst <- xgboost(data = x, label = y, nthread = 1, nround = 100, verbose = F)
    })
Loading required package: xgboost
   user  system elapsed
 19.429   0.130  17.317

r$> system.time({
      bst <- xgboost(data = x, label = y, nthread = 4, nround = 100, verbose = F)
    })
   user  system elapsed
 17.949   0.063   4.538

r$> system.time({
      bst <- xgboost(data = x, label = y, nthread = 8, nround = 100, verbose = F)
    })
   user  system elapsed
 27.401   0.094   3.457

Even though this method is not that elegant, I guess I am fine with such revision. Furthermore, it brings no harm to keep such settings which may be also beneficial to other packages which need compilation and employ multi-threads.

Thanks @hcho3 all the same, and hope this post can add some hints about installing xgboost under conda environment under MacOS.

asam46 commented 3 years ago

Another remark here is about installation of xgboost python package within conda environment. The short finding is that with libomp installed using

brew install libomp

then installation using conda install -c conda-forge xgboost can make multi-thread available. In my experiment, I also employ mkl to accelarate numpy, just like what I set for conda environment of R:

conda install -c conda-forge numpy libblas=3.9.0=9_mkl

and then install xgboost:

conda install -c conda-forge xgboost

Try the following

In [1]: import numpy as np
   ...: import xgboost as xgb
   ...: import timeit
   ...:
   ...: data = np.random.rand(10000, 100)
   ...: label = np.random.randint(2, size=10000)
   ...: dtrain = xgb.DMatrix(data, label=label)
   ...:
   ...: param_1 = {'objective': 'binary:logistic', 'nthread': 1, 'eval_metric': 'auc'}
   ...:
   ...: param_4 = {'objective': 'binary:logistic', 'nthread': 4, 'eval_metric': 'auc'}
   ...:
   ...: param_8 = {'objective': 'binary:logistic', 'nthread': 8, 'eval_metric': 'auc'}
   ...:
   ...: num_round = 100

In [2]: start = timeit.default_timer()
   ...:
   ...: xgb.train(param_1, dtrain, num_round)
   ...:
   ...: stop = timeit.default_timer()
   ...:
   ...: print('Time: ', stop - start)
Time:  16.160123399

In [3]: start = timeit.default_timer()
   ...:
   ...: xgb.train(param_4, dtrain, num_round)
   ...:
   ...: stop = timeit.default_timer()
   ...:
   ...: print('Time: ', stop - start)
Time:  4.242956155000002

In [4]: start = timeit.default_timer()
   ...:
   ...: xgb.train(param_8, dtrain, num_round)
   ...:
   ...: stop = timeit.default_timer()
   ...:
   ...: print('Time: ', stop - start)
Time:  3.200284463999999

Therefore, in terms of xgboost python package within conda environment under MacOS, OpenMP is correctly set to be in use. However, for xgboost R package under MacOS, installation from source is necessary to allow OpenMP. For system-wide R, just follow https://xgboost.readthedocs.io/en/latest/build.html#installing-the-development-version-linux-mac-osx; for R within conda environment my above solution may be the easy but not that elegant way to fix it. On the other hand, it is necessary to install libomp for MacOS at the very beginning:

brew install libomp