pytorch_inference crash on Intel x86 mac

pointearth commented 3 weeks ago

Elasticsearch Version

8.14

Installed Plugins

No response

Java Version

bundled

OS Version

macOS 13.6.7 on MacBook Pro 15-inch, 2017

Problem Description

following your official lab, I am trying to use Langchain RAG with Elasticsearch.

Steps to Reproduce

Install python libraries ( langchain langchainhub 'eland[pytorch]' torch torchvision torchaudio)
Install elastic search and kibana.
I asked for a Trial license for kibana
Download and deploy a ".elser_model_2" model in Kibana
create an index named "rag-example" in Kibana
write data into elasticsearch (save_docs.py), it successes, I see there are 2 records in the index
send questions to elasticsearch. get response: (raise HTTP_EXCEPTIONS.get(meta.status, ApiError)( elasticsearch.ApiError: ApiError(500, 'status_exception', 'Error in inference process: [inference canceled as process is stopping]') code.zip

Logs (if relevant)

Elasticsearch logs:

[2024-06-15T10:37:27,661][ERROR][o.e.x.m.p.AbstractNativeProcess] [simonMarvel.local] [.elser_model_2_1] pytorch_inference/17196 process stopped unexpectedly: 
[2024-06-15T10:37:27,666][ERROR][o.e.x.m.i.d.DeploymentManager] [simonMarvel.local] [.elser_model_2_1] inference process crashed due to reason [[.elser_model_2_1] pytorch_inference/17196 process stopped unexpectedly: ]
[2024-06-15T10:37:27,662][WARN ][r.suppressed             ] [simonMarvel.local] path: /rag-example/_search, params: {index=rag-example, _source_includes=metadata,text}, status: 500
org.elasticsearch.ElasticsearchStatusException: Error in inference process: [inference canceled as process is stopping]
    at org.elasticsearch.xpack.ml.inference.deployment.AbstractPyTorchAction.onFailure(AbstractPyTorchAction.java:114) ~[?:?]
    at org.elasticsearch.xpack.ml.inference.deployment.InferencePyTorchAction.processResult(InferencePyTorchAction.java:182) ~[?:?]
    at org.elasticsearch.xpack.ml.inference.deployment.InferencePyTorchAction.lambda$doRun$3(InferencePyTorchAction.java:151) ~[?:?]
    at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:171) ~[elasticsearch-8.14.1.jar:?]
    at org.elasticsearch.xpack.ml.inference.pytorch.process.PyTorchResultProcessor.lambda$notifyAndClearPendingResults$3(PyTorchResultProcessor.java:145) ~[?:?]
    at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1605) ~[?:?]
    at org.elasticsearch.xpack.ml.inference.pytorch.process.PyTorchResultProcessor.notifyAndClearPendingResults(PyTorchResultProcessor.java:144) ~[?:?]
    at org.elasticsearch.xpack.ml.inference.pytorch.process.PyTorchResultProcessor.process(PyTorchResultProcessor.java:137) ~[?:?]
    at org.elasticsearch.xpack.ml.inference.deployment.DeploymentManager.lambda$startDeployment$2(DeploymentManager.java:180) ~[?:?]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917) ~[elasticsearch-8.14.1.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
    at java.lang.Thread.run(Thread.java:1570) ~[?:?]
[2024-06-15T10:37:27,667][INFO ][o.e.x.m.i.d.DeploymentManager] [simonMarvel.local] Inference process [.elser_model_2_1] failed due to [[.elser_model_2_1] pytorch_inference/17196 process stopped unexpectedly: ]. This is the [3] failure in 24 hours, and the process will be restarted.
[2024-06-15T10:37:28,273][INFO ][o.e.x.m.i.d.DeploymentManager] [simonMarvel.local] [.elser_model_2_1] Starting model deployment of model [.elser_model_2]

=====================================================================

Crash reports: to see the attachment torch_inference_log.pl.zip

elasticsearchmachine commented 2 weeks ago

Pinging @elastic/ml-core (Team:ML)

pointearth commented 2 weeks ago

Is there any situation as follows:

It is a bug
the on-premises version doesn't support this feature yet
the failure is caused by wrong settings
other reason. @elasticsearchmachine @pgomulka @ncdc

davidkyle commented 2 weeks ago

Thanks for the crash report, the pytorch_inference process crashed with segmentation violation. It looks like you are using an Intel based mac, our testing runs on Apple silicon. I will try to reproduce the issue on similar hardware.

pointearth commented 2 weeks ago

@davidkyle Appreciate it! yes, my laptop is an intel based mac, if it doesn't work because of Intel chip, it is acceptable. I would like to suggest putting a warning on the ML overview page when detecting the hardware that doesn't support it yet. you can close this issue. Thanks!

davidkyle commented 2 weeks ago

I reproduced the same crash in pytorch_inference on macOS 14.5 with Elasticsearch version 8.14.2. On the same machine Elasticsearch version 8.12.2 does not crash, I suspect the cause of the crash is the upgrade to PyTorch 2.1.2 (https://github.com/elastic/ml-cpp/pull/2588) in version 8.13.

Sadly I think this cannot be fixed until the next upgrade of PyTorch and I can't give you a timeline on that.

Machine Learning on x86 macs is deprecated and won't be supported after December 2024, you might have better luck running it inside docker as suggested here, or you can temporarily downgrade to 8.12.

https://github.com/elastic/elasticsearch/pull/104087/files#diff-d71d08cc6d198f07a28ee2e1179053197941de85851c6797f30a516af523d158R927-R930

elastic / elasticsearch