elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.73k stars 8.14k forks source link

Failing test: X-Pack Detection Engine API Integration Tests.x-pack/test/detection_engine_api_integration/security_and_spaces/tests/exception_operators_data_types/ip·ts - detection engine api security and spaces enabled Detection exceptions data types and operators Rule exception operators for data type ip "is in list" operator will return 2 results if we have a list that includes 2 ips #84724

Closed kibanamachine closed 3 years ago

kibanamachine commented 3 years ago

A test failed on a tracked branch

Error: timed out waiting for function condition to be true within waitForRuleSuccess
    at /dev/shm/workspace/parallel/14/kibana/x-pack/test/detection_engine_api_integration/utils.ts:708:9

First failure: Jenkins Build

FrankHassanabad commented 3 years ago

Looking at the error message I see this:

[00:33:04]                   │ proc [kibana]   log   [02:56:49.580] [info][eventLog][plugins] event logged: {"@timestamp":"2020-12-02T02:56:49.385Z","event":{"provider":"alerting","action":"execute","start":"2020-12-02T02:56:49.385Z","outcome":"failure","end":"2020-12-02T02:56:49.579Z","duration":194000000,"reason":"execute"},"kibana":{"saved_objects":[{"rel":"primary","type":"alert","id":"70b24216-2d74-44bf-8e51-22fdf27a3cdf"}],"alerting":{"status":"error"},"server_uuid":"5b2de169-2785-441b-ae8c-186a1936b17d"},"message":"alert execution failure: siem.signals:70b24216-2d74-44bf-8e51-22fdf27a3cdf: 'Signal Testing Query'","error":{"message":"[siem-detection-engine-rule-status:05cf0650-344a-11eb-9427-d156bf25f2fc]: version conflict, required seqNo [8374], primary term [1]. current document has seqNo [8375] and primary term [1]: version_conflict_engine_exception"},"ecs":{"version":"1.6.0"}}
[00:33:05]                   │ proc [kibana]   log   [02:56:50.569] [error][alerting][alerts][plugins][plugins] Executing Alert "70b24216-2d74-44bf-8e51-22fdf27a3cdf" has resulted in Error: [siem-detection-engine-rule-status:05cf0650-344a-11eb-9427-d156bf25f2fc]: version conflict, required seqNo [8374], primary term [1]. current document has seqNo [8375] and primary term [1]: version_conflict_engine_exception
[00:33:38]                   └- ✖ fail: detection engine api security and spaces enabled Detection exceptions data types and operators Rule exception operators for data type ip "is in list" operator will return 2 results if we have a list that includes 2 ips
[00:33:38]                   │      Error: timed out waiting for function condition to be true within waitForRuleSuccess
[00:33:38]                   │       at /dev/shm/workspace/parallel/14/kibana/x-pack/test/detection_engine_api_integration/utils.ts:708:9
[00:33:38]                   │ 
[00:33:38]                   │

Expanding and formatting that error coming from the alerting engine I see this which looks like it is an error that is being thrown from the executor out and the alerting framework is the one catching the error:

{
   "@timestamp":"2020-12-02T02:56:49.385Z",
   "event":{
      "provider":"alerting",
      "action":"execute",
      "start":"2020-12-02T02:56:49.385Z",
      "outcome":"failure",
      "end":"2020-12-02T02:56:49.579Z",
      "duration":194000000,
      "reason":"execute"
   },
   "kibana":{
      "saved_objects":[
         {
            "rel":"primary",
            "type":"alert",
            "id":"70b24216-2d74-44bf-8e51-22fdf27a3cdf"
         }
      ],
      "alerting":{
         "status":"error"
      },
      "server_uuid":"5b2de169-2785-441b-ae8c-186a1936b17d"
   },
   "message":"alert execution failure: siem.signals:70b24216-2d74-44bf-8e51-22fdf27a3cdf: 'Signal Testing Query'",
   "error":{
      "message":"[siem-detection-engine-rule-status:05cf0650-344a-11eb-9427-d156bf25f2fc]: version conflict, required seqNo [8374], primary term [1]. current document has seqNo [8375] and primary term [1]: version_conflict_engine_exception"
   },
   "ecs":{
      "version":"1.6.0"
   }
}
FrankHassanabad commented 3 years ago

If this is outside of our try/catch block and the alerting framework is the one catching it, it very well could be from a line such as this one: https://github.com/elastic/kibana/blob/master/x-pack/plugins/security_solution/server/lib/detection_engine/signals/signal_rule_alert_type.ts#L163

Where we are going to run:

await ruleStatusService.goingToRun();

Or one of these others: https://github.com/elastic/kibana/blob/master/x-pack/plugins/security_solution/server/lib/detection_engine/signals/signal_rule_alert_type.ts#L185

https://github.com/elastic/kibana/blob/master/x-pack/plugins/security_solution/server/lib/detection_engine/signals/signal_rule_alert_type.ts#L629

Where we do:

await ruleStatusService.error(gapMessage, { gap: gapString });
await ruleStatusService.error(message, {
  bulkCreateTimeDurations: result.bulkCreateTimes,
  searchAfterTimeDurations: result.searchAfterTimes,
  lastLookBackDate: result.lastLookBackDate?.toISOString(),
});

However, what I'm wondering is that as we update a single rule's status we have a problem where although everything should be linear, we are deleting/re-creating the same rule with the same rule_id in the tests which will indicate that we have a race condition where one rule executor is still running and updating its status while it's actually been deleted and the second rule is trying to update the same one and then we get a conflict.

If so, that is a valid bug where a rule executor can blow up on a conflict because of a previously deleted rule is still running with the same rule_id.

The other possibility is that the rule has "updated" and returned from the SO, but in reality the SO is still processing the change and when it goes to do a second update it hits a conflict. If that is true, that would be a more severe bug as that could cause sudden rule executions to not operate.

We don't specify concurrency control options within the SO's it looks like for the updates but there does look like a possibility of a document being used regardless: https://github.com/elastic/kibana/blob/master/src/core/server/saved_objects/service/lib/repository.ts#L1808-L1809

But when I put console.log() statements around there I did not see concurrency control being activated which leads to the mystery of how that version conflict is happening if we aren't sending it. Might be another code path I'm not seeing (or more recent code I don't have).

If we keep seeing this and it's still a mystery we could probably reduce the chances of this occurring in the tests by removing the same rule_id and using an auto-generated one here: https://github.com/elastic/kibana/blob/master/x-pack/test/detection_engine_api_integration/utils.ts#L105

So we could just change that line and make a commit and the chances of conflicts would hopefully drop.

FrankHassanabad commented 3 years ago

Looking closer at the code we also don't send down a wait_for which the SO's do support if we thought that even though we are not asking explicitly for OCC, we are potentially having race conditions where within the same rule there is a chance that the updates for the rule are happening out of order.

FrankHassanabad commented 3 years ago

Ok, so this does look like a race condition of some sorts and it looks like I can reproduce this now by adding more of delay with the saved objects.

So for this to be reproducible you have to modify this code block: https://github.com/elastic/kibana/blob/master/src/core/server/saved_objects/service/lib/repository.ts#L1020

With this ugly little delay. Notice that your TypeScript won't be happy but this is for us to make the problem reproducible:

Change it like so:

  ... other code ....
  async update<T = unknown>(
    type: string,
    id: string,
    attributes: Partial<T>,
    options: SavedObjectsUpdateOptions = {}
  ): Promise<SavedObjectsUpdateResponse<T>> {
    if (attributes.status === 'going to run' && this.yield !== true) {
      console.log('FOUND GOING TO YIELD FOR 5 seconds');
      setTimeout(() => {
        this.yield = true;
        this.update(type, id, attributes, options);
      }, 5000);
    } else if (attributes.status === 'going to run' && this.yield === true) {
      console.log('YIELDED ONCE');
    }
.... Other code ...

Then run our tests like so:

node scripts/functional_tests --config x-pack/test/detection_engine_api_integration/security_and_spaces/config.ts --include x-pack/test/detection_engine_api_integration/security_and_spaces/tests/exception_operators_data_types/ip.ts

And eventually you should see a promise rejection like so:

"}] in policy [.items-default]
                   │ info [o.e.x.i.IndexLifecycleTransition] [MacBook-Pro.local] moving index [.items-default-000001] from [{"phase":"hot","action":"unfollow","name":"wait-for-indexing-complete"}] to [{"phase":"hot","action":"unfollow","name":"wait-for-follow-shard-tasks"}] in policy [.items-default]
                   │ info [rule_exceptions/ip] Loading "mappings.json"
                   │ info [rule_exceptions/ip] Loading "data.json"
                   │ info [o.e.c.m.MetadataCreateIndexService] [MacBook-Pro.local] [ip] creating index, cause [api], templates [], shards [1]/[1]
                   │ info [rule_exceptions/ip] Created index "ip"
                   │ info [rule_exceptions/ip] Indexed 4 docs into "ip"
                 │ proc [kibana] FOUND GOING TO YIELD FOR 5 seconds
                 │ proc [kibana] FOUND GOING TO YIELD FOR 5 seconds
                 │ proc [kibana]   log   [11:57:13.067] [info][eventLog][plugins] event logged: {"@timestamp":"2020-12-02T18:57:10.622Z","event":{"provider":"alerting","action":"execute","start":"2020-12-02T18:57:10.622Z","outcome":"success","end":"2020-12-02T18:57:13.066Z","duration":2444000000},"kibana":{"saved_objects":[{"rel":"primary","type":"alert","id":"53d99201-5c2c-445a-9537-d317fd98547a"}],"alerting":{"status":"ok"},"server_uuid":"5b2de169-2785-441b-ae8c-186a1936b17d"},"message":"alert executed: siem.signals:53d99201-5c2c-445a-9537-d317fd98547a: 'Signal Testing Query'","ecs":{"version":"1.6.0"}}
                 └- ✓ pass  (10.3s) "detection engine api security and spaces enabled Detection exceptions data types and operators Rule exception operators for data type ip "is" operator should filter 4 ips if all are set as exceptions"
               └-> "after each" hook
                 │ info [o.e.c.m.MetadataDeleteIndexService] [MacBook-Pro.local] [.siem-signals-default-000001/8yhO8SZqSmqB7nukmCUbwA] deleting index
                 │ info [o.e.c.m.MetadataIndexTemplateService] [MacBook-Pro.local] removing template [.siem-signals-default]
                 │ proc [kibana] YIELDED ONCE
                 │ proc [kibana] YIELDED ONCE
                 │ proc [kibana]   log   [11:57:16.038] [error][data][elasticsearch] [version_conflict_engine_exception]: [siem-detection-engine-rule-status:2eb83c40-34d0-11eb-8593-3912766b9ed7]: version conflict, required seqNo [106], primary term [1]. current document has seqNo [111] and primary term [1]
                 │ proc [kibana] Unhandled Promise rejection detected:
                 │ proc [kibana] 
                 │ proc [kibana] ResponseError: [siem-detection-engine-rule-status:2eb83c40-34d0-11eb-8593-3912766b9ed7]: version conflict, required seqNo [106], primary term [1]. current document has seqNo [111] and primary term [1]: version_conflict_engine_exception
                 │ proc [kibana]     at IncomingMessage.<anonymous> (/Users/frankhassanabad/projects/kibana/node_modules/@elastic/elasticsearch/lib/Transport.js:272:25)
                 │ proc [kibana]     at IncomingMessage.emit (events.js:326:22)
                 │ proc [kibana]     at endReadableNT (_stream_readable.js:1223:12)
                 │ proc [kibana]     at processTicksAndRejections (internal/process/task_queues.js:84:21) {
                 │ proc [kibana]   meta: {
                 │ proc [kibana]     body: { error: [Object], status: 409 },
                 │ proc [kibana]     statusCode: 409,
                 │ proc [kibana]     headers: {
                 │ proc [kibana]       'content-type': 'application/json; charset=UTF-8',
                 │ proc [kibana]       'content-length': '661'
                 │ proc [kibana]     },
                 │ proc [kibana]     meta: {
                 │ proc [kibana]       context: null,
                 │ proc [kibana]       request: [Object],
                 │ proc [kibana]       name: 'elasticsearch-js',
                 │ proc [kibana]       connection: [Object],
                 │ proc [kibana]       attempts: 0,
                 │ proc [kibana]       aborted: false
                 │ proc [kibana]     }
                 │ proc [kibana]   },
                 │ proc [kibana]   isBoom: true,
                 │ proc [kibana]   isServer: false,
                 │ proc [kibana]   data: null,
                 │ proc [kibana]   output: {
                 │ proc [kibana]     statusCode: 409,
                 │ proc [kibana]     payload: {
                 │ proc [kibana]       message: '[siem-detection-engine-rule-status:2eb83c40-34d0-11eb-8593-3912766b9ed7]: version conflict, required seqNo [106], primary term [1]. current document has seqNo [111] and primary term [1]: version_conflict_engine_exception',
                 │ proc [kibana]       statusCode: 409,
                 │ proc [kibana]       error: 'Conflict'
                 │ proc [kibana]     },
                 │ proc [kibana]     headers: {}
                 │ proc [kibana]   },
                 │ proc [kibana]   [Symbol(SavedObjectsClientErrorCode)]: 'SavedObjectsClient/conflict'
                 │ proc [kibana] }
                 │ proc [kibana] 
                 │ proc [kibana] Terminating process...
Unhandled Promise rejection detected:
rudolf commented 3 years ago

We don't specify concurrency control options within the SO's it looks like for the updates but there does look like a possibility of a document being used regardless: https://github.com/elastic/kibana/blob/master/src/core/server/saved_objects/service/lib/repository.ts#L1808-L1809

But when I put console.log() statements around there I did not see concurrency control being activated which leads to the mystery of how that version conflict is happening if we aren't sending it. Might be another code path I'm not seeing (or more recent code I don't have).

This sounds like you're hitting a version conflict error which happens internally to Elasticsearch https://github.com/elastic/kibana/issues/82719#issuecomment-724643442

FrankHassanabad commented 3 years ago

This sounds like you're hitting a version conflict error which happens internally to Elasticsearch #82719 (comment)

I thought so too and added console log statements around there but didn't see it returning the version. I could be missing it from an earlier rule run or something else "getting in the way" so to speak. I was pretty thorough about it as I really thought that was the issue as well when looking through the code.

From #82719 (comment)

This makes me think retry_on_conflict should be the default behaviour when using SavedObjectsClient#update without a version parameter.

If we had the retry flag to push down or that default behavior it would fix us for sure and we wouldn't have to do a workaround on our side to prevent this from happening and I am ++ on that!