elastic / ml-cpp

Machine learning C++ code
Other
149 stars 62 forks source link

Fatal error: 'si_signo 11, si_code: 1, ..." when _infer using elser model, after pytorch 2.1 upgrade #2617

Closed wwang500 closed 7 months ago

wwang500 commented 7 months ago

Summary

After recent upgrade to pytorch 2.1, we observed there is a fatal error during Elser model inference testing on main branch

Step to reproduce:

  1. Using cloud staging environment, deploy a 8.13.0-snapshot cluster with 32GB ram x 2 zones ml nodes: image

  2. Restore beir-fiqa dataset from ml-data or ml-qa snapshot bucket,

  3. Download and start .elser_model_2_linux-x86_64 model with 2 allocations x 16 threads,

  4. Run below requests from dev tool:

    
    PUT _index_template/reindex-beir
    {
    "index_patterns": ["reindex-beir-*"],
    "template": {
        "settings": {"index.mapping.total_fields.limit": 2000},
        "mappings": {
            "properties": {"ml.tokens": {"type": "rank_features"}, "txt": {"type": "text"}}
        }
    }
    }

PUT _ingest/pipeline/elser-v2-test { "processors": [ { "inference": { "ignore_failure": false, "model_id": ".elser_model_2_linux-x86_64", "target_field": "ml", "field_map": {"txt": "text_field"}, "inference_config": {"text_expansion": {"results_field": "tokens"}} } } ], "on_failure": [ { "set": { "description": "Record error information", "field": "error_information", "value": "Processor {{ _ingest.on_failure_processor_type }} with tag {{ _ingest.on_failure_processor_tag }} in pipeline {{ _ingest.on_failure_pipeline }} failed with message {{ _ingest.on_failure_message }}" } } ] }

POST _reindex?wait_for_completion=false { "source": {"index": "beir-fiqa", "query": {"exists": {"field": "txt"}}, "size": 100}, "dest": {"index": "reindex-beir-fiqa", "pipeline": "elser-v2-test"} }

GET _tasks/....


**Observed:**
Task got stuck and there are errors in [es log](https://platform-logging.kb.us-central1.gcp.foundit.no/app/r/s/J9GcV): 

[instance-0000000002] [.elser_model_2_linux-x86_64] inference process crashed due to reason [[.elser_model_2_linux-x86_64] pytorch_inference/710 process stopped unexpectedly: Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: (nil), library: (null), base: (nil), normalized address: (nil)', version: 8.13.0-SNAPSHOT (build e7f6029489fc16) ]

droberts195 commented 7 months ago

@wwang500 for step 2, what are the exact commands? I tried this:

PUT _snapshot/ml-qa
    {
      "type": "gcs",
      "settings": {
        "bucket": "ml-qa",
        "client": "default"
      }
    }

GET _snapshot/ml-qa/beir-fiqa

The response was:

{
  "error": {
    "root_cause": [
      {
        "type": "snapshot_missing_exception",
        "reason": "[ml-qa:beir-fiqa] is missing"
      }
    ],
    "type": "snapshot_missing_exception",
    "reason": "[ml-qa:beir-fiqa] is missing"
  },
  "status": 404
}
droberts195 commented 7 months ago

I managed to reproduce this on a self-managed single-node cluster running on a c6i.4xlarge AWS instance. That instance type has 16 vCPUs so I deployed the model with 1 allocation and 16 threads so that the single running process setup would match the one in the original description. The process suffered a SEGV pretty soon after startup. Next step is to get a core dump...

droberts195 commented 7 months ago

The core dump shows this stack trace:

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007fc48fadc017 in ?? () from /home/ubuntu/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/linux-x86_64/bin/../lib/libmkl_gnu_thread.so
#2  0x00007fc48fad0fa0 in mkl_vml_serv_threader_s_1i_1o ()
   from /home/ubuntu/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/linux-x86_64/bin/../lib/libmkl_gnu_thread.so
#3  0x00007fc490570da2 in vmsLn () from /home/ubuntu/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/linux-x86_64/bin/../lib/libmkl_intel_lp64.so
#4  0x00007fc498b83f44 in ?? () from /home/ubuntu/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/linux-x86_64/bin/../lib/libtorch_cpu.so
#5  0x00007fc494624826 in at::TensorIteratorBase::serial_for_each(c10::function_ref<void (char**, long const*, long, long)>, at::Range) const ()
   from /home/ubuntu/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/linux-x86_64/bin/../lib/libtorch_cpu.so
#6  0x00007fc49462497a in ?? () from /home/ubuntu/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/linux-x86_64/bin/../lib/libtorch_cpu.so
#7  0x00007fc4898e4616 in ?? () from /home/ubuntu/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/linux-x86_64/bin/../lib/libgomp.so.1
#8  0x00007fc4930ff6db in start_thread (arg=0x7fc473fff700) at pthread_create.c:463
#9  0x00007fc49278b61f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Another interesting finding: the reindex succeeds if run with 1 allocation and 1, 2 or 8 threads. It also succeeds if run with 2 allocations and 8 threads. It fails almost immediately with 1 allocation and 16 threads.

The non-optimised ELSER model also works with 1 allocation and 16 threads.

From the stack, the code of at::TensorIteratorBase::serial_for_each is here. It's likely the next, unnamed, function on the call stack is this. After that we're into the closed-source MKL library.

Looking at the history of TensorIterator.cpp there are 8 commits since the version that's in 2.1.2. Next step is to look at these commits in more detail and see if any appear to be fixing a bug that might explain a core dump.

droberts195 commented 7 months ago

there are 8 commits since the version that's in 2.1.2

None of the 8 commits are likely to have any impact on this problem:

  1. https://github.com/pytorch/pytorch/commit/b7a95f4fdb8a79dc459cc757dafcdbd0953b1a62 = Include what you use
  2. https://github.com/pytorch/pytorch/commit/83deaa16edd6ace0d19aa0683d7ac21d7be6d223 = Revert previous commit
  3. https://github.com/pytorch/pytorch/commit/afed0314a89b822389bfa1b269582ad0ce1a50b1 = Typos in comments
  4. https://github.com/pytorch/pytorch/commit/424c093fc7adbd359a0f2df59d15da363ac8ff7e = Typos in comments
  5. https://github.com/pytorch/pytorch/commit/feb479757f95091903a16442d1afe1f7c9c405d3 = Make addc[mul|div] support different out dtypes
  6. https://github.com/pytorch/pytorch/commit/bae61ecb962edb5e9766cac195c09fe3bf85828f = Include what you use revert revert
  7. https://github.com/pytorch/pytorch/commit/516bd4a72c0e265e786ab6862bd8e7f71f2483a9 = Use std::in_place
  8. https://github.com/pytorch/pytorch/commit/cd084c4909017b71a520d15b45129dc7ccfd8dbf = Adding methods for const inputs

The problem also reproduces with PyTorch 2.2.0-rc8, which is pretty recent. So a further upgrade won't help us.

droberts195 commented 7 months ago

A stack trace with more debug symbols looks like this:

(gdb) bt   
#0  __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:96
#1  0x00007ff9082ce4a3 in _IO_vfprintf_internal (s=0x7ff8e57e1930, 
    format=0x7ff90910d290 "si_signo %d, si_code: %d, si_errno: %d, address: %p, library: %s, base: %p, normalized address: %p\n", ap=0x7ff8e57e3ff0)
    at vfprintf.c:1643
#2  0x00007ff9082cf640 in buffered_vfprintf (s=s@entry=0x7ff90865d680 <_IO_2_1_stderr_>, 
    format=format@entry=0x7ff90910d290 "si_signo %d, si_code: %d, si_errno: %d, address: %p, library: %s, base: %p, normalized address: %p\n", 
    args=args@entry=0x7ff8e57e3ff0) at vfprintf.c:2329
#3  0x00007ff9082cc6f6 in _IO_vfprintf_internal (s=s@entry=0x7ff90865d680 <_IO_2_1_stderr_>, 
    format=format@entry=0x7ff90910d290 "si_signo %d, si_code: %d, si_errno: %d, address: %p, library: %s, base: %p, normalized address: %p\n", 
    ap=ap@entry=0x7ff8e57e3ff0) at vfprintf.c:1301
#4  0x00007ff9083a3326 in ___fprintf_chk (fp=0x7ff90865d680 <_IO_2_1_stderr_>, flag=flag@entry=1, 
    format=format@entry=0x7ff90910d290 "si_signo %d, si_code: %d, si_errno: %d, address: %p, library: %s, base: %p, normalized address: %p\n") at fprintf_chk.c:35
#5  0x00007ff908fbb745 in fprintf (__fmt=0x7ff90910d290 "si_signo %d, si_code: %d, si_errno: %d, address: %p, library: %s, base: %p, normalized address: %p\n", 
    __stream=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/stdio2.h:97
#6  ml::core::crashHandler (sig=11, info=0x7ff8e57e4270, context=<optimized out>) at /home/ubuntu/ml-cpp/lib/core/CCrashHandler_Linux.cc:57
#7  <signal handler called>
#8  0x0000000000000000 in ?? ()
#9  0x00007ff9056e9039 in mkl_vml_serv_threader_s_1i_1o.omp_fn ()
   from /home/ubuntu/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/linux-x86_64/bin/../lib/libmkl_gnu_thread.so
#10 0x00007ff9056ddfa0 in mkl_vml_serv_threader_s_1i_1o ()
   from /home/ubuntu/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/linux-x86_64/bin/../lib/libmkl_gnu_thread.so
#11 0x00007ff90617dda2 in vmsLn () from /home/ubuntu/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/linux-x86_64/bin/../lib/libmkl_intel_lp64.so
#12 0x00007ff90e7561d9 in at::vml::AVX512::vlog<float> (size=<optimized out>, in=<optimized out>, out=<optimized out>)
    at /home/ubuntu/tools/pytorch/aten/src/ATen/cpu/vml.h:148
#13 at::vml::AVX512::vlog<float> (size=<optimized out>, in=0x7ff8e07b2b48, out=0x7ff8e081f348) at /home/ubuntu/tools/pytorch/aten/src/ATen/cpu/vml.h:148
#14 operator() (data_=<optimized out>, strides=<optimized out>, n=<optimized out>, __closure=<optimized out>)
    at /home/ubuntu/tools/pytorch/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp:859
#15 0x00007ff90e78ced4 in operator() (size1=1310979, size0=2035, strides=0x0, base=0x7ff8e57e50d0, this=0x0) at /home/ubuntu/tools/pytorch/c10/util/SmallVector.h:321
#16 void c10::function_ref<void (char**, long const*, long, long)>::callback_fn<auto at::TensorIteratorBase::loop_2d_from_1d<at::native::AVX512::log_kernel(at::TensorIteratorBase&)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda(char**, long const*, long)#1}>(at::native::AVX512::log_kernel(at::TensorIteratorBase&)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda(char**, long const*, long)#1} const&)::{lambda(char**, long const*, long, long)#1}>(long, char**, long const*, long, long) () at /home/ubuntu/tools/pytorch/c10/util/FunctionRef.h:43
#17 0x00007ff90a22d876 in c10::function_ref<void (char**, long const*, long, long)>::operator()(char**, long const*, long, long) const (params#3=1, params#2=2035, 
    params#1=<optimized out>, params#0=<optimized out>, this=<synthetic pointer>) at /home/ubuntu/tools/pytorch/c10/util/FunctionRef.h:63
#18 at::internal::serial_for_each(c10::ArrayRef<long>, c10::ArrayRef<long>, char**, unsigned long, c10::function_ref<void (char**, long const*, long, long)>, at::Range) (strides=..., strides=..., range=..., loop=..., ntensors=<optimized out>, base_ptrs=0x7ff8e57e5020, shape=...)
    at /home/ubuntu/tools/pytorch/aten/src/ATen/TensorIteratorInternal.h:56
#19 at::TensorIteratorBase::serial_for_each(c10::function_ref<void (char**, long const*, long, long)>, at::Range) const ()
    at /home/ubuntu/tools/pytorch/aten/src/ATen/TensorIterator.cpp:777
#20 0x00007ff90a22d9ca in operator() (end=<optimized out>, begin=12210, __closure=<optimized out>) at /home/ubuntu/tools/pytorch/aten/src/ATen/core/Range.h:9
#21 void at::internal::invoke_parallel<at::TensorIteratorBase::for_each(c10::function_ref<void (char**, long const*, long, long)>, long)::{lambda(long, long)#1}>(long, long, long, at::TensorIteratorBase::for_each(c10::function_ref<void (char**, long const*, long, long)>, long)::{lambda(long, long)#1} const&) [clone ._omp_fn.0] ()
    at /home/ubuntu/tools/pytorch/aten/src/ATen/ParallelOpenMP.h:42
#22 0x00007ff8ff4f27e6 in gomp_thread_start (xdata=<optimized out>) at ../../../gcc-10.3.0/libgomp/team.c:123
#23 0x00007ff908d0c6db in start_thread (arg=0x7ff8e57ea700) at pthread_create.c:463
#24 0x00007ff90839261f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The problem is the strides=0x0 and this=0x0 in the arguments of frame 15:

(gdb) frame 15
#15 0x00007ff90e78ced4 in operator() (size1=1310979, size0=2035, strides=0x0, base=0x7ff8e57e50d0, this=0x0) at /home/ubuntu/tools/pytorch/c10/util/SmallVector.h:321
321       pointer data() {
(gdb) info locals
i = <optimized out>
__for_range = <optimized out>
__for_begin = <optimized out>
__for_end = <optimized out>
data = {<c10::SmallVectorImpl<char*>> = {<c10::SmallVectorTemplateBase<char*, true>> = {<c10::SmallVectorTemplateCommon<char*, void>> = {<c10::SmallVectorBase<unsigned int>> = {BeginX = 0x7ff8e57e4f80, Size = 2, Capacity = 4}, <No data fields>}, 
      static TakesParamByValue = true}, <No data fields>}, <c10::SmallVectorStorage<char*, 4>> = {
    InlineElts = "H\363\201\340\370\177\000\000H+{\340\370\177\000\000\240\325=\341\370\177\000\000\200\037c|\370\177\000"}, <No data fields>}
outer_strides = <optimized out>
ntensor = <error reading variable ntensor (Cannot access memory at address 0x4)>
loop = <error reading variable loop (Cannot access memory at address 0x0)>
ntensor = <optimized out>
loop = <optimized out>
data = <optimized out>
outer_strides = <optimized out>
i = <optimized out>
__for_range = <optimized out>
__for_begin = <optimized out>
__for_end = <optimized out>
arg = <optimized out>
__for_range = <optimized out>
__for_begin = <optimized out>
__for_end = <optimized out>
(gdb) info args
size1 = 1310979
size0 = 2035
strides = 0x0
base = 0x7ff8e57e50d0
this = 0x0

It's very hard to see what's going on given all the optimisation:

(gdb) frame 18
#18 at::internal::serial_for_each(c10::ArrayRef<long>, c10::ArrayRef<long>, char**, unsigned long, c10::function_ref<void (char**, long const*, long, long)>, at::Range) (strides=..., strides=..., range=..., loop=..., ntensors=<optimized out>, base_ptrs=0x7ff8e57e5020, shape=...)
    at /home/ubuntu/tools/pytorch/aten/src/ATen/TensorIteratorInternal.h:56
56            loop(ptrs.data(), strides.data(), range.size(), 1);
(gdb) info args
strides = <optimized out>
strides = <optimized out>
range = {begin = 12210, end = 14245}
loop = {
  callback = 0x7ff90e78cdf0 <void c10::function_ref<void (char**, long const*, long, long)>::callback_fn<auto at::TensorIteratorBase::loop_2d_from_1d<at::native::AVX512::log_kernel(at::TensorIteratorBase&)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda(char**, long const*, long)#1}>(at::native::AVX512::log_kernel(at::TensorIteratorBase&)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda(char**, long const*, long)#1} const&)::{lambda(char**, long const*, long, long)#1}>(long, char**, long const*, long, long)>, callable = 140707155117976}
ntensors = <optimized out>
base_ptrs = 0x7ff8e57e5020
shape = <optimized out>
(gdb) info locals
ptrs = {storage_ = {_M_elems = {0x7ff8e081f348 "", 0x7ff8e07b2b48 "", 
      0x7ff8e13dd5a0 "6\216V\300\266U\366\277g*\202\300\214\277\200\300\354wj\300\246\257\207\300C\321\001\300\245\370g\300\371\315V\300[\364(\300\321\017\342\277>F\021\300\003\210\024\300Z\224\t\300\024\307\330\277:\020\254\277\032+\017\300(\024@\300\356m`\300\t\221j\300\374\310]\300kH?\300\341\321v\300[Zl\300'\272\f\300}\206\337\277\026Tn\300}\226C\300\330\345\343\277\356\352\005\300o\034p\300guf\300W+~\300\377\067!\300\253\255\016\300\271\357k\300\364\062\f\300\334,\024\300\323\267\272\277\022\fO\300&tZ\300f\240\216\300\210\270h\300\354\234V\300\362", 0xfe49d <error: Cannot access memory at address 0xfe49d>}}, size_ = 2, data_ = 0x7ff8e57e50d0}
ndim = <optimized out>
ndim = <optimized out>
__func__ = <optimized out>
ptrs = <optimized out>
ptrs = <optimized out>
counter = <optimized out>
step = <optimized out>
droberts195 commented 7 months ago

A bisection of the commits between the point where 1.13 branched off of main and the point where 2.1 branched off of main shows that the commit that causes the crash is https://github.com/pytorch/pytorch/commit/d4c8e37b8563a6a8448b6c0d6c0868fe0eb11366. Running with a build of that commit results in a SEGV with a stack trace that's basically the same as the one above. Running with the commit immediately before it, https://github.com/pytorch/pytorch/commit/0de81906ccb3473abd4f9bc50b218724c93193c6, does not suffer a crash.

The problematic commit came from https://github.com/pytorch/pytorch/pull/91963. That PR changed two files. The next step is to examine those changes more closely and see if there's an obvious bug, or if not, if reverting the changes in just one or other of the two files works and avoids the problem.

droberts195 commented 7 months ago

if reverting the changes in just one or other of the two files works and avoids the problem.

Unfortunately two builds from the v2.1.2 commit with changes to just one of the files from https://github.com/pytorch/pytorch/pull/91963 reverted both crashed.

droberts195 commented 7 months ago

I tried building a completely unoptimised PyTorch 2.1.2 (using export CMAKE_BUILD_TYPE=Debug).

This still crashes, with the following stack trace:

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007f0a88252039 in mkl_vml_serv_threader_s_1i_1o.omp_fn ()
   from /home/ubuntu/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/linux-x86_64/bin/../lib/libmkl_gnu_thread.so
#2  0x00007f0a88246fa0 in mkl_vml_serv_threader_s_1i_1o ()
   from /home/ubuntu/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/linux-x86_64/bin/../lib/libmkl_gnu_thread.so
#3  0x00007f0a88ce6da2 in vmsLn ()
   from /home/ubuntu/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/linux-x86_64/bin/../lib/libmkl_intel_lp64.so
#4  0x00007f0a93d51afa in at::vml::AVX512::vlog<float> (out=0x7f0a647ec06c, in=0x7f0a65940cac, size=2035)
    at /home/ubuntu/tools/pytorch/aten/src/ATen/cpu/vml.h:148
#5  0x00007f0a93c1a592 in operator() (__closure=0x7f0a73ff8368, data_=0x7f0a4748be20, strides=0x7f0a4748c100, n=2035)
    at /home/ubuntu/tools/pytorch/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp:859
#6  0x00007f0a93c5424d in operator() (this=0x7f0a73ff8368, base=0x7f0a4748bfc0, strides=0x7f0a4748c100, size0=2035, size1=1)
    at /home/ubuntu/tools/pytorch/aten/src/ATen/TensorIterator.h:406
#7  0x00007f0a93ccdf95 in c10::function_ref<void(char**, long int const*, long int, long int)>::callback_fn<at::TensorIteratorBase::loop_2d_from_1d<at::native::AVX512::log_kernel(at::TensorIteratorBase&)::<lambda()>::<lambda(:148
)>::<lambda(char**, const int64_t*, int64_t)> >::<lambda(char**, const int64_t*, int64_t, int64_t)> >(intptr_t, char **, const long *, long, long) (callable=139682872525672, params#0=0x7f0a4748bfc0, params#1=0x7f0a4748c100, 
    params#2=2035, params#3=1) at /home/ubuntu/tools/pytorch/c10/util/FunctionRef.h:43
#8  0x00007f0a8cada519 in c10::function_ref<void (char**, long const*, long, long)>::operator()(char**, long const*, long, long) const (
    this=0x7f0a4748c050, params#0=0x7f0a4748bfc0, params#1=0x7f0a4748c100, params#2=2035, params#3=1)
    at /home/ubuntu/tools/pytorch/c10/util/FunctionRef.h:64
#9  0x00007f0a8cad8fc3 in at::internal::serial_for_each(c10::ArrayRef<long>, c10::ArrayRef<long>, char**, unsigned long, c10::function_ref<void (char**, long const*, long, long)>, at::Range) (shape=..., strides=..., base_ptrs=0x7f0a4748c0d0, ntensors=2, loop=..., range=...)
    at /home/ubuntu/tools/pytorch/aten/src/ATen/TensorIteratorInternal.h:56
#10 0x00007f0a8cad0e61 in at::TensorIteratorBase::serial_for_each(c10::function_ref<void (char**, long const*, long, long)>, at::Range) const (
    this=0x7f0a73ff8520, loop=..., range=...) at /home/ubuntu/tools/pytorch/aten/src/ATen/TensorIterator.cpp:777
#11 0x00007f0a8cad0a9f in operator() (__closure=0x7f0a73ff8320, begin=18315, end=20350)
    at /home/ubuntu/tools/pytorch/aten/src/ATen/TensorIterator.cpp:752
#12 0x00007f0a8cad7679 in at::internal::_ZN2at8internal15invoke_parallelIZNS_18TensorIteratorBase8for_eachEN3c1012function_refIFvPPcPKlllEEElEUlllE_EEvlllRKT_._omp_fn.0(void) () at /home/ubuntu/tools/pytorch/aten/src/ATen/ParallelOpenMP.h:42
#13 0x00007f0a8205b7e6 in gomp_thread_start (xdata=<optimized out>) at ../../../gcc-10.3.0/libgomp/team.c:123
#14 0x00007f0a8b84f6db in start_thread (arg=0x7f0a47491700) at pthread_create.c:463
#15 0x00007f0a8aed561f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This time this and stride are not shown as NULL pointers, but the stack trace still looks very similar even into MKL. It makes me think that they're only shown as NULL in the optimised trace because of being optimised away in some of the intermediate functions.

The last frame for which we have full debug now does not show any problems:

(gdb) frame 4
#4  0x00007f0a93d51afa in at::vml::AVX512::vlog<float> (out=0x7f0a647ec06c, in=0x7f0a65940cac, size=2035)
    at /home/ubuntu/tools/pytorch/aten/src/ATen/cpu/vml.h:148
148     IMPLEMENT_VML_MKL(log, Ln)
(gdb) info locals
max_mkl_ind = 2147483647
(gdb) info args
out = 0x7f0a647ec06c
in = 0x7f0a65940cac
size = 2035
(gdb) print *((float*)0x7f0a65940cac)@2035
$6 = {1 <repeats 353 times>, 1.34128916, 1 <repeats 22 times>, 1.08084893, 1 <repeats 548 times>, 1.53102648, 1 <repeats 897 times>, 1.5380137, 
  1 <repeats 211 times>}
(gdb) print *((float*)0x7f0a647ec06c)@2035
$5 = {0 <repeats 353 times>, 0.341289192, 0 <repeats 22 times>, 0.0808489323, 0 <repeats 548 times>, 0.531026483, 0 <repeats 897 times>, 0.538013756, 
  0 <repeats 211 times>}

So the in and out arrays are wholly within permitted memory bounds and don't explain the SEGV. It looks like an intermediate stage of the natural logarithm calculation has been reached where each element of out is set to 1 less than the corresponding value of in.

Inside MKL we have no symbols, but the instruction in frame 1 was this:

(gdb) frame 1
#1  0x00007f0a88252039 in mkl_vml_serv_threader_s_1i_1o.omp_fn ()
   from /home/ubuntu/elasticsearch-8.13.0-SNAPSHOT/modules/x-pack-ml/platform/linux-x86_64/bin/../lib/libmkl_gnu_thread.so
(gdb) x/i 0x00007f0a88252039
=> 0x7f0a88252039 <mkl_vml_serv_threader_s_1i_1o.omp_fn.43+85>: callq  0x7f0a86ac1b80 <omp_get_thread_num@plt>

omp_get_thread_num is a function in libgomp.so.1, which libmkl_gnu_thread.so does not explicitly link to:

$ ldd libmkl_gnu_thread.so
        linux-vdso.so.1 (0x00007ffd283b0000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f0c7164f000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f0c71430000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0c7103f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f0c735b3000)

It relies on it being loaded in the current process already.

The segmentation fault looks like it's occurring because the procedure linkage table does not contain address of omp_get_thread_num, but a NULL pointer instead. For security reasons we link our programs with -Wl,-z,relro -Wl,-z,now. It may be that that is the cause of the problem.

It's not completely why this problem would be exposed by a commit in PyTorch that didn't change build options in any way. But it might be that by increasing use of parallelisation an MKL library decides to use OMP when it never did before.

droberts195 commented 7 months ago

It seems that this problem can be fixed by upgrading to the latest version of MKL. At least, it appears to work with me building everything by hand. It may be that PyTorch 2.x was incompatible with MKL from 2020, or used it in a new way that exposed a bug.

Next week I'll try to do a build using a proper Docker image with latest MKL installed and see if that works too.

droberts195 commented 7 months ago

The changes of #2619 as built by the PR CI of that PR fix the problem for my reproduction steps of https://github.com/elastic/ml-cpp/issues/2617#issuecomment-1910716107 when copied over manually to my test machine. Therefore it's pretty certain that that PR fixes this issue.