Closed mtojek closed 11 months ago
Pinging @elastic/fleet (Team:Fleet)
Another occurrence:
[2021-05-25T04:07:21.978Z] Error: error running package system tests: could not complete test run: failed to teardown runner: error reassigning original policy to agent: could not assign policy to agent; API status code = 500; response body = {"statusCode":500,"error":"Internal Server Error","message":"version_conflict_engine_exception"}
@jen-huang @ruflin I think this issue requires a higher priority. It would be great if somebody can look into this.
There was a previous report in #97114, not sure if root cause is same or different. FYI for investigation.
Today hit two issues in a single build.
[2021-06-23T10:49:42.420Z] Error: error running package system tests: could not complete test run: could not assign policy to agent: could not assign policy to agent; API status code = 500; response body = {"statusCode":500,"error":"Internal Server Error","message":"version_conflict_engine_exception: [version_conflict_engine_exception] Reason: [911f5746-2072-486c-bcee-f46ad10d130f]: version conflict, required seqNo [8], primary term [1]. current document has seqNo [10] and primary term [1]"}
A small update.
Looking at the kibana logs, the error comes from a call to /agent/{id}/reassign
{ ... "url":"http://127.0.0.1:5601/api/fleet/agents/121e93a6-e996-4a74-a9b9-6a988db6106c/reassign","message":"Internal Server Error"}
In the test log we can see the endpoint being called twice:
...
[2021-06-30T10:54:51.544Z] 2021/06/30 10:54:51 DEBUG reassigning original policy back to agent...
[2021-06-30T10:54:51.544Z] 2021/06/30 10:54:51 DEBUG PUT http://127.0.0.1:5601/api/fleet/agents/121e93a6-e996-4a74-a9b9-6a988db6106c/reassign
[2021-06-30T10:54:51.544Z] 2021/06/30 10:54:51 DEBUG reassigning original policy back to agent...
[2021-06-30T10:54:51.544Z] 2021/06/30 10:54:51 DEBUG PUT http://127.0.0.1:5601/api/fleet/agents/121e93a6-e996-4a74-a9b9-6a988db6106c/reassign
...
My guess is that this two calls are causing a race condition making the test fail.
I'm going to dig on the test side to see why the two calls are being made. I want to know though if it also makes sense to address this on the kibana side (by reporting a better error, or retrying the request if the error happens). @elastic/fleet WTYT?
We discussed it offline with @afgomez and got following conclusions:
/reassign
are caused by a bug in elastic-package. Potentially it's fixed with: https://github.com/elastic/elastic-package/commit/7ca3684a19090de1407b4f90c8fbc7bd8d61ba58 .Hi @afgomez @jen-huang ,
It seems that issue is not fully solved: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/571/pipeline/
[2021-07-22T04:09:39.580Z] 2021/07/22 04:09:39 DEBUG GET http://127.0.0.1:5601/api/fleet/agents
[2021-07-22T04:09:39.840Z] 2021/07/22 04:09:39 DEBUG filter agents using criteria: NamePrefix=docker-fleet-agent
[2021-07-22T04:09:39.840Z] 2021/07/22 04:09:39 DEBUG found 1 enrolled agent(s)
[2021-07-22T04:09:39.840Z] 2021/07/22 04:09:39 DEBUG creating test policy...
[2021-07-22T04:09:39.840Z] 2021/07/22 04:09:39 DEBUG POST http://127.0.0.1:5601/api/fleet/agent_policies
[2021-07-22T04:09:43.130Z] 2021/07/22 04:09:42 DEBUG adding package data stream to test policy...
[2021-07-22T04:09:43.130Z] 2021/07/22 04:09:42 DEBUG POST http://127.0.0.1:5601/api/fleet/package_policies
[2021-07-22T04:09:45.037Z] 2021/07/22 04:09:44 DEBUG deleting old data in data stream...
[2021-07-22T04:09:45.037Z] 2021/07/22 04:09:44 DEBUG found 0 hits in logs-cisco.ios-ep data stream
[2021-07-22T04:09:45.037Z] 2021/07/22 04:09:44 DEBUG GET http://127.0.0.1:5601/api/fleet/agent_policies/a2746270-eaa2-11eb-9a9b-775da073752c
[2021-07-22T04:09:45.037Z] 2021/07/22 04:09:44 DEBUG assigning package data stream to agent...
[2021-07-22T04:09:45.037Z] 2021/07/22 04:09:44 DEBUG PUT http://127.0.0.1:5601/api/fleet/agents/f4265efc-f8e0-48f0-8cbf-25ac983eac84/reassign
[2021-07-22T04:09:46.421Z] 2021/07/22 04:09:46 DEBUG GET http://127.0.0.1:5601/api/fleet/agents/f4265efc-f8e0-48f0-8cbf-25ac983eac84
[2021-07-22T04:09:46.421Z] 2021/07/22 04:09:46 DEBUG Wait until the policy (ID: a2746270-eaa2-11eb-9a9b-775da073752c, revision: 2) is assigned to the agent (ID: f4265efc-f8e0-48f0-8cbf-25ac983eac84)...
[2021-07-22T04:09:48.327Z] 2021/07/22 04:09:48 DEBUG GET http://127.0.0.1:5601/api/fleet/agents/f4265efc-f8e0-48f0-8cbf-25ac983eac84
[2021-07-22T04:09:48.328Z] 2021/07/22 04:09:48 DEBUG Wait until the policy (ID: a2746270-eaa2-11eb-9a9b-775da073752c, revision: 2) is assigned to the agent (ID: f4265efc-f8e0-48f0-8cbf-25ac983eac84)...
[2021-07-22T04:09:50.232Z] 2021/07/22 04:09:50 DEBUG GET http://127.0.0.1:5601/api/fleet/agents/f4265efc-f8e0-48f0-8cbf-25ac983eac84
[2021-07-22T04:09:50.232Z] 2021/07/22 04:09:50 DEBUG Wait until the policy (ID: a2746270-eaa2-11eb-9a9b-775da073752c, revision: 2) is assigned to the agent (ID: f4265efc-f8e0-48f0-8cbf-25ac983eac84)...
[2021-07-22T04:09:52.771Z] 2021/07/22 04:09:52 DEBUG GET http://127.0.0.1:5601/api/fleet/agents/f4265efc-f8e0-48f0-8cbf-25ac983eac84
[2021-07-22T04:09:52.771Z] 2021/07/22 04:09:52 DEBUG Wait until the policy (ID: a2746270-eaa2-11eb-9a9b-775da073752c, revision: 2) is assigned to the agent (ID: f4265efc-f8e0-48f0-8cbf-25ac983eac84)...
[2021-07-22T04:09:54.677Z] 2021/07/22 04:09:54 DEBUG GET http://127.0.0.1:5601/api/fleet/agents/f4265efc-f8e0-48f0-8cbf-25ac983eac84
[2021-07-22T04:09:54.677Z] 2021/07/22 04:09:54 DEBUG Policy revision assigned to the agent (ID: f4265efc-f8e0-48f0-8cbf-25ac983eac84)...
[2021-07-22T04:09:54.677Z] 2021/07/22 04:09:54 DEBUG checking for expected data in data stream...
[2021-07-22T04:09:54.677Z] 2021/07/22 04:09:54 DEBUG found 13 hits in logs-cisco.ios-ep data stream
[2021-07-22T04:09:55.246Z] 2021/07/22 04:09:55 DEBUG reassigning original policy back to agent...
[2021-07-22T04:09:55.246Z] 2021/07/22 04:09:55 DEBUG PUT http://127.0.0.1:5601/api/fleet/agents/f4265efc-f8e0-48f0-8cbf-25ac983eac84/reassign
[2021-07-22T04:09:55.246Z] Error: error running package system tests: could not complete test run: failed to tear down runner: error reassigning original policy to agent: could not assign policy to agent; API status code = 500; response body = {"statusCode":500,"error":"Internal Server Error","message":"version_conflict_engine_exception: [version_conflict_engine_exception] Reason: [f4265efc-f8e0-48f0-8cbf-25ac983eac84]: version conflict, required seqNo [67], primary term [1]. current document has seqNo [68] and primary term [1]"}
script returned exit code 1
As you can see - no double calls to /reassign
. Did you spot anything on the Kibana side? It happened again for the elastic/integrations master (CI run).
Hi @mtojek I've been having a look at picking this up but I am struggling to find which tests these are that are causing the issue, I can't seem to find them in the integrations repo? The builds have since been deleted so struggling to find more clues
Fresh sample from today, Integrations repo, zeek tests:
(thinking out loud 😄)Looking at the latest failure above, I am struggling to understand exactly what is causing this, I believe the reassign call is the only modification to the agent document during the test. Here is a breakdown of the calls to /reassign before the error, taken from the server logs (kibana.log) , it doesn't look to be under much stress:
Log timestamp | Response code | Response time |
---|---|---|
2021-07-30T08:10:50 | 200 | 1876 ms |
2021-07-30T08:11:05 | 200 | 780 ms |
2021-07-30T08:11:28 | 200 | 1867 ms |
2021-07-30T08:11:40 | 200 | 1845 ms |
2021-07-30T08:12:04 | 200 | 1883 ms |
2021-07-30T08:12:19 | 200 | 1590 ms |
2021-07-30T08:12:43 | 200 | 1860 ms |
2021-07-30T08:12:53 | 200 | 1801 ms |
2021-07-30T08:13:17 | 200 | 857 ms |
2021-07-30T08:13:30 | 200 | 840 ms |
2021-07-30T08:13:53 | 200 | 845 ms |
2021-07-30T08:14:05 | 200 | 1572 ms |
2021-07-30T08:14:27 | 200 | 837 ms |
2021-07-30T08:14:40 | 200 | 852 ms |
2021-07-30T08:15:03 | 200 | 837 ms |
2021-07-30T08:15:15 | 500 | 84 ms |
I can see from the update code that we wait_for
refresh.
The test logs do not show any other requests inflight at the time, I am obviously missing something here, potentially a non-API update to the agent document?
Another sample today: https://beats-ci.elastic.co/blue/organizations/jenkins/Ingest-manager%2Fintegrations/detail/master/637/pipeline
We don't do any extra calls "under the table". Every call is logged.
Just to update where I am with this, I've been running local tests in the background for the past couple of days which constantly reassign the agent policy (the same as the jenkins tests but without the data ingest aspect) and I haven't been able to reproduce. Something about the CI environment obviously exacerbates this issue.
I am running out of ideas, I may need pointers from someone else from @elastic/fleet with a bit more experience on things to look into.
@hop-dev Do you think that you could add to Kibana source additional debug log lines to bring us closer to the root cause?
@hop-dev
The test logs do not show any other requests inflight at the time, I am obviously missing something here, potentially a non- API update to the agent document?
The fleet agent document is also updated by Fleet Server, I guess the conflict is happening because we are trying to write it at the same time as fleet server
@nchaulet thanks for that! I guess we have the option for kibana to suppress the error and retry on the callers behalf, or for us to continue to pass the error back to the client and have them retry (and in this case have the integration tests retry on 409) , do you have an opinion either way? I cant see many places in kibana where we retry on 409
On suggestion from @mtojek I'm adding here links to recent build failure where this error showed up:
Stale. Closing.
Spotted in: https://github.com/elastic/integrations/pull/987
I found this error after failed CI test run:
Source: https://beats-ci.elastic.co/job/Ingest-manager/job/integrations/job/PR-987/4/artifact/build/elastic-stack-dump/latest/zeek/logs/kibana.log
Elasticsearch: https://beats-ci.elastic.co/job/Ingest-manager/job/integrations/job/PR-987/4/artifact/build/elastic-stack-dump/latest/zeek/logs/elasticsearch.log