elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.63k stars 8.22k forks source link

Internal Server Error: version_conflict_engine_exception #100031

Closed mtojek closed 11 months ago

mtojek commented 3 years ago

Spotted in: https://github.com/elastic/integrations/pull/987

I found this error after failed CI test run:

kibana_1                     | {"type":"log","@timestamp":"2021-05-13T13:28:17+00:00","tags":["error","plugins","fleet"],"pid":961,"message":"ResponseError: version_conflict_engine_exception\n    at onBody (/usr/share/kibana/node_modules/@elastic/elasticsearch/lib/Transport.js:337:23)\n    at IncomingMessage.onEnd (/usr/share/kibana/node_modules/@elastic/elasticsearch/lib/Transport.js:264:11)\n    at IncomingMessage.emit (events.js:327:22)\n    at endReadableNT (internal/streams/readable.js:1327:12)\n    at processTicksAndRejections (internal/process/task_queues.js:80:21) {\n  meta: {\n    body: { error: [Object], status: 409 },\n    statusCode: 409,\n    headers: {\n      'content-type': 'application/json; charset=UTF-8',\n      'content-length': '601'\n    },\n    meta: {\n      context: null,\n      request: [Object],\n      name: 'elasticsearch-js',\n      connection: [Object],\n      attempts: 0,\n      aborted: false\n    }\n  }\n}"}
kibana_1                     | {"type":"error","@timestamp":"2021-05-13T13:28:17+00:00","tags":[],"pid":961,"level":"error","error":{"message":"Internal Server Error","name":"Error","stack":"Error: Internal Server Error\n    at HapiResponseAdapter.toError (/usr/share/kibana/src/core/server/http/router/response_adapter.js:124:19)\n    at HapiResponseAdapter.toHapiResponse (/usr/share/kibana/src/core/server/http/router/response_adapter.js:78:19)\n    at HapiResponseAdapter.handle (/usr/share/kibana/src/core/server/http/router/response_adapter.js:73:17)\n    at Router.handle (/usr/share/kibana/src/core/server/http/router/router.js:164:34)\n    at runMicrotasks (<anonymous>)\n    at processTicksAndRejections (internal/process/task_queues.js:93:5)\n    at handler (/usr/share/kibana/src/core/server/http/router/router.js:124:50)\n    at exports.Manager.execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/toolkit.js:60:28)\n    at Object.internals.handler (/usr/share/kibana/node_modules/@hapi/hapi/lib/handler.js:46:20)\n    at exports.execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/handler.js:31:20)\n    at Request._lifecycle (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:370:32)\n    at Request._execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:279:9)"},"url":"http://127.0.0.1:5601/api/fleet/agents/4ae5c899-6169-4b5d-a014-082ec360dcee/reassign","message":"Internal Server Error"}

Source: https://beats-ci.elastic.co/job/Ingest-manager/job/integrations/job/PR-987/4/artifact/build/elastic-stack-dump/latest/zeek/logs/kibana.log

Elasticsearch: https://beats-ci.elastic.co/job/Ingest-manager/job/integrations/job/PR-987/4/artifact/build/elastic-stack-dump/latest/zeek/logs/elasticsearch.log

elasticmachine commented 3 years ago

Pinging @elastic/fleet (Team:Fleet)

mtojek commented 3 years ago

Another occurrence:

https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/339/pipeline

[2021-05-25T04:07:21.978Z] Error: error running package system tests: could not complete test run: failed to teardown runner: error reassigning original policy to agent: could not assign policy to agent; API status code = 500; response body = {"statusCode":500,"error":"Internal Server Error","message":"version_conflict_engine_exception"}
mtojek commented 3 years ago

Another one from today:

https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/355/pipeline

mtojek commented 3 years ago

@jen-huang @ruflin I think this issue requires a higher priority. It would be great if somebody can look into this.

jen-huang commented 3 years ago

There was a previous report in #97114, not sure if root cause is same or different. FYI for investigation.

mtojek commented 3 years ago

Today hit two issues in a single build.

[2021-06-23T10:49:42.420Z] Error: error running package system tests: could not complete test run: could not assign policy to agent: could not assign policy to agent; API status code = 500; response body = {"statusCode":500,"error":"Internal Server Error","message":"version_conflict_engine_exception: [version_conflict_engine_exception] Reason: [911f5746-2072-486c-bcee-f46ad10d130f]: version conflict, required seqNo [8], primary term [1]. current document has seqNo [10] and primary term [1]"}

Pipeline: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/448/pipeline/321

mtojek commented 3 years ago

Another one: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/500/pipeline

afgomez commented 3 years ago

A small update.

Looking at the kibana logs, the error comes from a call to /agent/{id}/reassign

{ ... "url":"http://127.0.0.1:5601/api/fleet/agents/121e93a6-e996-4a74-a9b9-6a988db6106c/reassign","message":"Internal Server Error"}

In the test log we can see the endpoint being called twice:

...
[2021-06-30T10:54:51.544Z] 2021/06/30 10:54:51 DEBUG reassigning original policy back to agent...
[2021-06-30T10:54:51.544Z] 2021/06/30 10:54:51 DEBUG PUT http://127.0.0.1:5601/api/fleet/agents/121e93a6-e996-4a74-a9b9-6a988db6106c/reassign
[2021-06-30T10:54:51.544Z] 2021/06/30 10:54:51 DEBUG reassigning original policy back to agent...
[2021-06-30T10:54:51.544Z] 2021/06/30 10:54:51 DEBUG PUT http://127.0.0.1:5601/api/fleet/agents/121e93a6-e996-4a74-a9b9-6a988db6106c/reassign
...

My guess is that this two calls are causing a race condition making the test fail.

I'm going to dig on the test side to see why the two calls are being made. I want to know though if it also makes sense to address this on the kibana side (by reporting a better error, or retrying the request if the error happens). @elastic/fleet WTYT?

mtojek commented 3 years ago

We discussed it offline with @afgomez and got following conclusions:

  1. Two calls to /reassign are caused by a bug in elastic-package. Potentially it's fixed with: https://github.com/elastic/elastic-package/commit/7ca3684a19090de1407b4f90c8fbc7bd8d61ba58 .
  2. The second call is an outcome of the first one - the first request ended up with HTTP 500, the second one with HTTP 200 (confirmed with logs).
  3. No idea what has caused problems with the first request on the Kibana side.
mtojek commented 3 years ago

Hi @afgomez @jen-huang ,

It seems that issue is not fully solved: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/571/pipeline/

[2021-07-22T04:09:39.580Z] 2021/07/22 04:09:39 DEBUG GET http://127.0.0.1:5601/api/fleet/agents
[2021-07-22T04:09:39.840Z] 2021/07/22 04:09:39 DEBUG filter agents using criteria: NamePrefix=docker-fleet-agent
[2021-07-22T04:09:39.840Z] 2021/07/22 04:09:39 DEBUG found 1 enrolled agent(s)
[2021-07-22T04:09:39.840Z] 2021/07/22 04:09:39 DEBUG creating test policy...
[2021-07-22T04:09:39.840Z] 2021/07/22 04:09:39 DEBUG POST http://127.0.0.1:5601/api/fleet/agent_policies
[2021-07-22T04:09:43.130Z] 2021/07/22 04:09:42 DEBUG adding package data stream to test policy...
[2021-07-22T04:09:43.130Z] 2021/07/22 04:09:42 DEBUG POST http://127.0.0.1:5601/api/fleet/package_policies
[2021-07-22T04:09:45.037Z] 2021/07/22 04:09:44 DEBUG deleting old data in data stream...
[2021-07-22T04:09:45.037Z] 2021/07/22 04:09:44 DEBUG found 0 hits in logs-cisco.ios-ep data stream
[2021-07-22T04:09:45.037Z] 2021/07/22 04:09:44 DEBUG GET http://127.0.0.1:5601/api/fleet/agent_policies/a2746270-eaa2-11eb-9a9b-775da073752c
[2021-07-22T04:09:45.037Z] 2021/07/22 04:09:44 DEBUG assigning package data stream to agent...
[2021-07-22T04:09:45.037Z] 2021/07/22 04:09:44 DEBUG PUT http://127.0.0.1:5601/api/fleet/agents/f4265efc-f8e0-48f0-8cbf-25ac983eac84/reassign
[2021-07-22T04:09:46.421Z] 2021/07/22 04:09:46 DEBUG GET http://127.0.0.1:5601/api/fleet/agents/f4265efc-f8e0-48f0-8cbf-25ac983eac84
[2021-07-22T04:09:46.421Z] 2021/07/22 04:09:46 DEBUG Wait until the policy (ID: a2746270-eaa2-11eb-9a9b-775da073752c, revision: 2) is assigned to the agent (ID: f4265efc-f8e0-48f0-8cbf-25ac983eac84)...
[2021-07-22T04:09:48.327Z] 2021/07/22 04:09:48 DEBUG GET http://127.0.0.1:5601/api/fleet/agents/f4265efc-f8e0-48f0-8cbf-25ac983eac84
[2021-07-22T04:09:48.328Z] 2021/07/22 04:09:48 DEBUG Wait until the policy (ID: a2746270-eaa2-11eb-9a9b-775da073752c, revision: 2) is assigned to the agent (ID: f4265efc-f8e0-48f0-8cbf-25ac983eac84)...
[2021-07-22T04:09:50.232Z] 2021/07/22 04:09:50 DEBUG GET http://127.0.0.1:5601/api/fleet/agents/f4265efc-f8e0-48f0-8cbf-25ac983eac84
[2021-07-22T04:09:50.232Z] 2021/07/22 04:09:50 DEBUG Wait until the policy (ID: a2746270-eaa2-11eb-9a9b-775da073752c, revision: 2) is assigned to the agent (ID: f4265efc-f8e0-48f0-8cbf-25ac983eac84)...
[2021-07-22T04:09:52.771Z] 2021/07/22 04:09:52 DEBUG GET http://127.0.0.1:5601/api/fleet/agents/f4265efc-f8e0-48f0-8cbf-25ac983eac84
[2021-07-22T04:09:52.771Z] 2021/07/22 04:09:52 DEBUG Wait until the policy (ID: a2746270-eaa2-11eb-9a9b-775da073752c, revision: 2) is assigned to the agent (ID: f4265efc-f8e0-48f0-8cbf-25ac983eac84)...
[2021-07-22T04:09:54.677Z] 2021/07/22 04:09:54 DEBUG GET http://127.0.0.1:5601/api/fleet/agents/f4265efc-f8e0-48f0-8cbf-25ac983eac84
[2021-07-22T04:09:54.677Z] 2021/07/22 04:09:54 DEBUG Policy revision assigned to the agent (ID: f4265efc-f8e0-48f0-8cbf-25ac983eac84)...
[2021-07-22T04:09:54.677Z] 2021/07/22 04:09:54 DEBUG checking for expected data in data stream...
[2021-07-22T04:09:54.677Z] 2021/07/22 04:09:54 DEBUG found 13 hits in logs-cisco.ios-ep data stream
[2021-07-22T04:09:55.246Z] 2021/07/22 04:09:55 DEBUG reassigning original policy back to agent...
[2021-07-22T04:09:55.246Z] 2021/07/22 04:09:55 DEBUG PUT http://127.0.0.1:5601/api/fleet/agents/f4265efc-f8e0-48f0-8cbf-25ac983eac84/reassign
[2021-07-22T04:09:55.246Z] Error: error running package system tests: could not complete test run: failed to tear down runner: error reassigning original policy to agent: could not assign policy to agent; API status code = 500; response body = {"statusCode":500,"error":"Internal Server Error","message":"version_conflict_engine_exception: [version_conflict_engine_exception] Reason: [f4265efc-f8e0-48f0-8cbf-25ac983eac84]: version conflict, required seqNo [67], primary term [1]. current document has seqNo [68] and primary term [1]"}

script returned exit code 1

As you can see - no double calls to /reassign. Did you spot anything on the Kibana side? It happened again for the elastic/integrations master (CI run).

hop-dev commented 3 years ago

Hi @mtojek I've been having a look at picking this up but I am struggling to find which tests these are that are causing the issue, I can't seem to find them in the integrations repo? The builds have since been deleted so struggling to find more clues

mtojek commented 3 years ago

Fresh sample from today, Integrations repo, zeek tests:

https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/597/pipeline

hop-dev commented 3 years ago

(thinking out loud 😄)Looking at the latest failure above, I am struggling to understand exactly what is causing this, I believe the reassign call is the only modification to the agent document during the test. Here is a breakdown of the calls to /reassign before the error, taken from the server logs (kibana.log) , it doesn't look to be under much stress:

Log timestamp Response code Response time
2021-07-30T08:10:50 200 1876 ms
2021-07-30T08:11:05 200 780 ms
2021-07-30T08:11:28 200 1867 ms
2021-07-30T08:11:40 200 1845 ms
2021-07-30T08:12:04 200 1883 ms
2021-07-30T08:12:19 200 1590 ms
2021-07-30T08:12:43 200 1860 ms
2021-07-30T08:12:53 200 1801 ms
2021-07-30T08:13:17 200 857 ms
2021-07-30T08:13:30 200 840 ms
2021-07-30T08:13:53 200 845 ms
2021-07-30T08:14:05 200 1572 ms
2021-07-30T08:14:27 200 837 ms
2021-07-30T08:14:40 200 852 ms
2021-07-30T08:15:03 200 837 ms
2021-07-30T08:15:15 500 84 ms

I can see from the update code that we wait_for refresh.

The test logs do not show any other requests inflight at the time, I am obviously missing something here, potentially a non-API update to the agent document?

mtojek commented 3 years ago

Another sample today: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/637/pipeline

We don't do any extra calls "under the table". Every call is logged.

hop-dev commented 3 years ago

Just to update where I am with this, I've been running local tests in the background for the past couple of days which constantly reassign the agent policy (the same as the jenkins tests but without the data ingest aspect) and I haven't been able to reproduce. Something about the CI environment obviously exacerbates this issue.

I am running out of ideas, I may need pointers from someone else from @elastic/fleet with a bit more experience on things to look into.

mtojek commented 3 years ago

@hop-dev Do you think that you could add to Kibana source additional debug log lines to bring us closer to the root cause?

nchaulet commented 3 years ago

@hop-dev

The test logs do not show any other requests inflight at the time, I am obviously missing something here, potentially a non- API update to the agent document?

The fleet agent document is also updated by Fleet Server, I guess the conflict is happening because we are trying to write it at the same time as fleet server

hop-dev commented 3 years ago

@nchaulet thanks for that! I guess we have the option for kibana to suppress the error and retry on the callers behalf, or for us to continue to pass the error back to the client and have them retry (and in this case have the integration tests retry on 409) , do you have an opinion either way? I cant see many places in kibana where we retry on 409

endorama commented 2 years ago

On suggestion from @mtojek I'm adding here links to recent build failure where this error showed up:

kpollich commented 11 months ago

Stale. Closing.