[ML] Double deployment of trained model causes assertion error

maxhniebergall commented 7 months ago

Elasticsearch Version

8.14.0-SNAPSHOT

Installed Plugins

No response

Java Version

JBR-17.0.9+8-1166.2-nomod

OS Version

23.3.0 Darwin Kernel Version 23.3.0: Wed Dec 20 21:30:44 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6000 arm64

Problem Description

When locally building elasticsearch (in debug mode), an assertion error occurs when attempting to perform inference.

[2024-02-14T14:14:06,790][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [runTask-0] fatal error in thread [elasticsearch[runTask-0][ml_native_inference_comms][T#3]], exiting java.lang.AssertionError
        at org.elasticsearch.ml@8.14.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.deployment.NlpInferenceInput.extractInput(NlpInferenceInput.java:55)
        at org.elasticsearch.ml@8.14.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.deployment.InferencePyTorchAction.doRun(InferencePyTorchAction.java:104)
        at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
        at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
        at org.elasticsearch.ml@8.14.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.pytorch.PriorityProcessWorkerExecutorService$OrderedRunnable.run(PriorityProcessWorkerExecutorService.java:58)
        at org.elasticsearch.ml@8.14.0-SNAPSHOT/org.elasticsearch.xpack.ml.job.process.AbstractProcessWorkerExecutorService.start(AbstractProcessWorkerExecutorService.java:122)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
        at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)

Steps to Reproduce

create deployment (I don't think this necessarily needs to be with the inference service, but thats what I tried)

(base) mh@Maxs-MacBook-Pro elasticsearch % curl -X PUT "localhost:9200/_inference/text_embedding/a-deployment-id2?pretty" \
-H 'Content-Type: application/json' -u elastic-admin:elastic-password \
-d'
  {
    "service": "text_embedding",
    "service_settings": {
      "num_allocations": 1,
      "num_threads": 1,
      "model_id": ".multilingual-e5-small"
    }
  }
'
{
  "model_id" : "a-deployment-id2",
  "task_type" : "text_embedding",
  "service" : "text_embedding",
  "service_settings" : {
    "num_allocations" : 1,
    "num_threads" : 1,
    "model_id" : ".multilingual-e5-small"
  },
  "task_settings" : { }
}

Put the same model:

PUT /_ml/trained_models/.multilingual-e5-small?pretty
{
  "input": {
    "field_names": ["text_field"]
 }
}

Run inference and the process crashes due to an assertion error.

[2024-02-14T14:14:06,790][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [runTask-0] fatal error in thread [elasticsearch[runTask-0][ml_native_inference_comms][T#3]], exiting java.lang.AssertionError
    at org.elasticsearch.ml@8.14.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.deployment.NlpInferenceInput.extractInput(NlpInferenceInput.java:55)
    at org.elasticsearch.ml@8.14.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.deployment.InferencePyTorchAction.doRun(InferencePyTorchAction.java:104)
    at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
    at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at org.elasticsearch.ml@8.14.0-SNAPSHOT/org.elasticsearch.xpack.ml.inference.pytorch.PriorityProcessWorkerExecutorService$OrderedRunnable.run(PriorityProcessWorkerExecutorService.java:58)
    at org.elasticsearch.ml@8.14.0-SNAPSHOT/org.elasticsearch.xpack.ml.job.process.AbstractProcessWorkerExecutorService.start(AbstractProcessWorkerExecutorService.java:122)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
    at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:917)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1583)

Logs (if relevant)

No response

elasticsearchmachine commented 7 months ago

Pinging @elastic/ml-core (Team:ML)

droberts195 commented 7 months ago

So the tripping assertion is this: https://github.com/elastic/elasticsearch/blob/e443a7b6baed6ad10d3ba1c978c2228d9280b94e/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/deployment/NlpInferenceInput.java#L55

maxhniebergall commented 7 months ago

I also disabled the assertion and it does cause the process to crash, so I'm actively investigating this one

maxhniebergall commented 7 months ago

On main I tried just putting the inference service and then doing the reindex, and I am still getting this error. I am going to try to do a clean build to make sure this isn't just picking up some bad artifact from one of my development branches. I would be surprised if this error is actually occurring on main as it appears to be preventing any inference with the internal inference services.

maxhniebergall commented 7 months ago

Still on main, I did a gradlew clean and a gradlew build. The build failed for some seemingly unrelated reason about a search outtage // all shards failed. A lot of builds are failing this morning it seems, so probably someone broke something.

I tried adding a new integration test to test for a double deployment issue, but the integration tests passed. I tried running the server and doing a manual test, and it failed again.

elastic / elasticsearch