dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.23k stars 8.72k forks source link

Errors under valgrind when openmp is enabled #8238

Closed kropacf closed 2 years ago

kropacf commented 2 years ago

xgboost: 1.6.1 model: 1.1.1 ubuntu 22.04

I found an error when I run predictions under valgrind. According to valgrind log the error is somewhere in openmp. So I tried to build xgboost without openMP (-DUSE_OPENMP=0) and error is gone. I know there are some known false positive errors when running openmp under valgrind (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36298) but I want to be sure this errors are caused by openmp not by xgboost. xgboost_valgrind_example.zip

Without openMP:

==7== Memcheck, a memory error detector
==7== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==7== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==7== Command: ./xgboost_valgrind
==7== 
[08:25:42] WARNING: /xgboost/src/learner.cc:749: Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3.
==7== 
==7== HEAP SUMMARY:
==7==     in use at exit: 0 bytes in 0 blocks
==7==   total heap usage: 262,933 allocs, 262,933 frees, 45,755,755 bytes allocated
==7== 
==7== All heap blocks were freed -- no leaks are possible
==7== 
==7== For lists of detected and suppressed errors, rerun with: -s
==7== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

With openMP:

==7== Memcheck, a memory error detector
==7== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==7== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==7== Command: ./xgboost_valgrind
==7== 
[08:23:52] WARNING: /xgboost/src/learner.cc:749: Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3.
==7== 
==7== HEAP SUMMARY:
==7==     in use at exit: 7,856 bytes in 15 blocks
==7==   total heap usage: 262,957 allocs, 262,942 frees, 45,825,843 bytes allocated
==7== 
==7== 3,520 bytes in 11 blocks are possibly lost in loss record 4 of 5
==7==    at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==7==    by 0x40147D9: calloc (rtld-malloc.h:44)
==7==    by 0x40147D9: allocate_dtv (dl-tls.c:375)
==7==    by 0x40147D9: _dl_allocate_tls (dl-tls.c:634)
==7==    by 0x514A834: allocate_stack (allocatestack.c:430)
==7==    by 0x514A834: pthread_create@@GLIBC_2.34 (pthread_create.c:647)
==7==    by 0x52FD1EF: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==7==    by 0x52F3A10: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==7==    by 0x4AE2B70: xgboost::gbm::GBTreeModel::LoadModel(xgboost::Json const&) (in /usr/local/lib/libxgboost.so)
==7==    by 0x4AB9DB1: xgboost::gbm::GBTree::LoadModel(xgboost::Json const&) (in /usr/local/lib/libxgboost.so)
==7==    by 0x4B02C57: xgboost::LearnerIO::LoadModel(xgboost::Json const&) (in /usr/local/lib/libxgboost.so)
==7==    by 0x4B0BB2F: xgboost::LearnerIO::LoadModel(dmlc::Stream*) (in /usr/local/lib/libxgboost.so)
==7==    by 0x10B581: main (main.cpp:13)
==7== 
==7== LEAK SUMMARY:
==7==    definitely lost: 0 bytes in 0 blocks
==7==    indirectly lost: 0 bytes in 0 blocks
==7==      possibly lost: 3,520 bytes in 11 blocks
==7==    still reachable: 4,336 bytes in 4 blocks
==7==         suppressed: 0 bytes in 0 blocks
==7== Reachable blocks (those to which a pointer was found) are not shown.
==7== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==7== 
==7== For lists of detected and suppressed errors, rerun with: -s
==7== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

I made small example in docker

docker build . -t valgrind_example
docker run -it --rm valgrind_example
trivialfis commented 2 years ago

We run address sanitizer with leak sanitizer on CI. I think it's false positive from valgrind.

trivialfis commented 2 years ago

Feel free to reopen if there's any sign of real memory leak.