[ML] ELSER crashes in local serverless setup

jonathan-buttner commented 6 months ago

Description

When interacting with ELSER in serverless locally it crashes when attempting to perform inference.

Steps to reproduce

Ensure docker is setup and running

Checkout kibana and bootstrap it
Start elasticsearch serverless locally: yarn es serverless --projectType=security --ssl
Start kibana locally yarn start --serverless=security --ssl
Download elser
Deploy elser via the inference API

PUT _inference/sparse_embedding/elser
{
  "service": "elser",
  "service_settings": {
    "model_id": ".elser_model_2",
    "num_allocations": 1,
    "num_threads": 1
  },
  "task_settings": {}
}

Add an ingest processor

PUT _ingest/pipeline/elser
{
  "processors": [
    {
      "inference": {
        "model_id": "elser",
        "input_output": [
            {
                "input_field": "content",
                "output_field": "text_embedding"
            }
        ]
      }
    },
    {
      "set": {
        "field": "timestamp",
        "value": "{{_ingest.timestamp}}"
      }
    }
  ]
}

Attempt to perform inference

POST _ingest/pipeline/elser/_simulate
{
  "docs": [
    {
      "_source": {
             "content": "hello" 
      }
    }]
}

Retrieve the stats from the trained models api to observe that the process has crashed

            "routing_state": {
              "routing_state": "failed",
              "reason": """inference process crashed due to reason [[my-elser-model] pytorch_inference/659 process stopped unexpectedly: Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: 0xffff83b20140, library: /lib/aarch64-linux-gnu/libc.so.6, base: 0xffff83a13000, normalized address: 0x10d140', version: 8.14.0-SNAPSHOT (build 38a5b0ec077958)
]"""
            },

elasticsearchmachine commented 6 months ago

Pinging @elastic/ml-core (Team:ML)

droberts195 commented 6 months ago

I just confirmed that these steps don't cause a crash in the ESS CFT region running 8.14.0-SNAPSHOT. This is interesting, because the code should be very similar.

Serverless is running on c6i instances in AWS. CFT is running on n2 instances in GCP. So the problem might be down to serverless or might be down to the exact type of hardware.

droberts195 commented 6 months ago

Logs show the crash happened on ARM:

"inference process crashed due to reason [[.elser_model_2] pytorch_inference/644 process stopped unexpectedly: Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: 0xffff7a188140, library: /lib/aarch64-linux-gnu/libc.so.6, base: 0xffff7a07b000, normalized address: 0x10d140', version: 8.14.0-SNAPSHOT (build 38a5b0ec077958)\n]"

ML nodes on serverless are supposed to be on Intel hardware. I just tried reproducing this in a serverless project and the steps worked fine. However, as expected, my ML node was on Intel.

So it may be that the bug here is really "ELSER crashes on ARM".

And then the next question would be how did we end up with an ML node on ARM in serverless?

droberts195 commented 6 months ago

Just reading through the report more closely, this wasn't even using real serverless. It was using simulated serverless running locally on a Mac. That explains why it was on ARM.

But also, running locally on a Mac, it's running Docker images in a Linux VM. We don't know how much memory that Linux VM had. It may be that it was trying to do too much in too little memory and because of the vagaries of Docker on a Mac that ended up as a SEGV rather than an out-of-memory error.

Given the circumstances I don't think this bug is anywhere near as serious as the title makes it sound.

droberts195 commented 6 months ago

I tried these steps on an m6g.2xlarge AWS instance, and they ran successfully without the process crashing.

(Originally, I tried on an m6g.large instance with 8GB RAM, and there pytorch_inference was killed by the OOM killer. But that was running Elasticsearch as a single node cluster, so 50% memory to the JVM heap, and Kibana also running on the same machine. So that problem really was due to lack of memory. On the 32GB m6g.2xlarge inference worked fine.)

Therefore, this problem really does seem to be confined to running in a Docker container in a Linux VM on ARM macOS. It's not great that this crash happens, and it's still a bug that running in Docker on a Mac doesn't work, but at least it's not going to affect customers in production.

maxjakob commented 4 months ago

I encountered this bug yesterday trying to set up some integration tests locally on my Mac through Docker. The problem is not ELSER-specific but happens for other trained models too. For local dev it would be quite nice to have this working.

sophiec20 commented 4 months ago

@maxjakob Which other trained models did you try?

maxjakob commented 4 months ago

I deployed sentence-transformers/msmarco-minilm-l-12-v3 with Eland which worked fine but upon ~search~ inference I got

"type": "status_exception",
"reason": "Error in inference process: [inference canceled as process is stopping]"

and the logs showed

... "message":"Inference process [sentence-transformers__msmarco-minilm-l-12-v3] failed due to [[sentence-transformers__msmarco-minilm-l-12-v3] pytorch_inference/229 process stopped unexpectedly:
Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: 0xffff8407c140,
library: /lib/aarch64-linux-gnu/libc.so.6,
base: 0xffff83f6f000, normalized address: 0x10d140', version: 8.13.2 (build fdd7177d8c1325)\n]. This is the [1] failure in 24 hours, and the process will be restarted.", ...

(line breaks from me to show that it's the same issue as reported above)

maxjakob commented 4 months ago

And I should add, this was with a regular Elasticsearch docker.elastic.co/elasticsearch/elasticsearch container, not with serverless!

tveasey commented 4 months ago

Looking back over the comments on this issue I'm trying to understand if the problem is running the linux version of our inference code on arm Macs.

There is no reason to expect that instructions used by libtorch will be supported if they don't exist on the target platform: it will use a lot of hand rolled SIMD stuff via the MKL. These are sometimes emulated, but it isn't guaranteed.

I would have bet that this is what is the cause, except the latest error report was for a SIGSEGV (11) rather than a SIGILL (4). In any case, I think we need to understand exactly what build of our code inference is being run in this scenario.

davidkyle commented 4 months ago

I've tested on a bunch of different docker versions and the good news is that before 8.13 you can run the ELSER model in docker on macOS without it crashing.

In 8.13 libtorch was upgraded (https://github.com/elastic/ml-cpp/pull/2612) to 2.1.2 from 1.13. This was a major version upgrade and could have introduced some incompatibility. MKL was also upgraded in 8.13 but that shouldn't be a problem as MKL is only used in the Linux x86 build and these crashes are on Aarch64 (library: /lib/aarch64-linux-gnu/libc.so.6).

Perhaps something changed in the way the docker image is created in 8.13 and it would be a good first step to eliminate that possibility

droberts195 commented 4 months ago

https://github.com/oneapi-src/oneDNN/pull/1832 looks interesting.

jonathan-buttner commented 4 months ago

Including some ideas from @davidkyle

Try rebuilding with an upgraded pytorch
Try rebuilding with this fix: https://github.com/oneapi-src/oneDNN/pull/1832

tushar8590 commented 4 months ago

I am using Mac OS 13.6.6 on Intel hardware. I have self-hosted Elastic Search version 8.13.2 on a local machine and getting the same error while running infer on a Huggingface model(entence-transformers__stsb-distilroberta-base-v2). Can someone help to troubleshoot the issue?

fred-maussion commented 1 month ago

Facing the same issue with different type of environment where I can't use ELSER (.elser_model_2_linux-x86_64). Model is deployed correctly but crash as soon as I try to call it.

Environment 1

Container
Elastic : v8.13.2 / v8.14.3
Docker Env
MacOS M1

Environment 2

Virtual Machine - Ubuntu 22.04 - x86
Elastic : v8.13.2 / v8.14.3
Docker Env

Environment 3

Virtual Machine - Ubuntu 22.04 - x86
Elastic : v8.13.2 / v8.14.3
Package installation

Error

On every environment, the following behavior.

Model are being deployed :

But getting the following error on ELSER Linux version as soon as I try to ingest the Observability Knowledge Base.

Let me know if I can help.

sejbot commented 3 weeks ago

I also experience this issue. Running plain Elasticsearch, not serverless.

My setup is a Macbook M1 Pro running macOS Sonoma 14.5 and I am running Elasticsearch in a docker container for local development and integration tests. I am using the bundled E5 model in Elasticsearch. Deployment works fine but inference is crashing the model.

Have tested it on 8.12.2, 8.14.1 and 8.15.0. 8.12.2 works fine, but on 8.14.1 and 8.15.0, the model crashes when using inference. This is the error output I get: [.multilingual-e5-small] inference process crashed due to reason [[.multilingual-e5-small] pytorch_inference/982 process stopped unexpectedly: Fatal error: 'si_signo 11, si_code: 1, si_errno: 0, address: 0xffff839a0140, library: /lib/aarch64-linux-gnu/libc.so.6, base: 0xffff83893000, normalized address: 0x10d140', version: 8.15.0 (build 64f00009177815)

edsavage commented 3 weeks ago

The crashes related to builds running Elasticsearch in a Docker container on macOS Silicon machines are almost certainly due to the xbyak_aarch64 bug fix mentioned above - https://github.com/elastic/elasticsearch/issues/106206#issuecomment-2110259312. I reproduced the crash and obtained a stack trace:

#0  raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000ffffa3ac34d4 in ml::core::crashHandler (sig=11, info=0xffff9d7f8940, context=<optimized out>) at /ml-cpp/lib/core/CCrashHandler_Linux.cc:65
#2  <signal handler called>
#3  0x0000ffffa359a140 in __aarch64_cas4_acq () from /lib/aarch64-linux-gnu/libc.so.6
#4  0x0000ffffa352c560 in __GI___readdir64 (dirp=dirp@entry=0x0) at ../sysdeps/posix/readdir.c:44
#5  0x0000ffffa8abeb34 in Xbyak_aarch64::util::Cpu::getFilePathMaxTailNumPlus1 (this=this@entry=0xffffaaf12730 <dnnl::impl::cpu::aarch64::cpu()::cpu_>, path=path@entry=0xffffa9d1cd48 "/sys/devices/system/node/node") at /usr/src/pytorch/third_party/ideep/mkl-dnn/src/cpu/aarch64/xbyak_aarch64/src/util_impl.cpp:175

This bug seems to have been fixed back in March and hence Pytorch 2.3.1 is unaffected.

toughcoding commented 2 weeks ago

Any procedure of PyTorch upgrade for Elasticsearch docker image available or we have to do it ourselves?

elastic / elasticsearch