elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.05k stars 4.89k forks source link

Duplicated Google Workspace log entries by Filebeat #39859

Open rlevytskyi opened 3 weeks ago

rlevytskyi commented 3 weeks ago

Many moths ago, we’ve noticed that some Google Workspace logs received by Filebeat got duplicated.

I’ve searched the internet for possible cause and find one similar issue here at Elastic Discuss, Google Workspace module using wrong field to avoid duplicates telling that "json.id.time", "json.id.uniqueQualifier", "json.id.applicationName", "json.id.customerId" are used to generate the _id.

After updating Filebeat to the most recent version (8.10.2 run from docker.elastic.co/beats/filebeat:8.10.2) I found that the same issue has different _id.

The issue was posted at forum https://discuss.elastic.co/t/duplicated-google-workspace-log-entries-by-filebeat/344374

Several days ago I've upgraded Filebeat to 8.14.0 and it didn't helped.

Here are two examples. First:

{
  "_index": "google_ws-2023.10.04",
  "_id": "4rrh-YoBiE7xzynem1Cm",
  "_source": { 
    "json": {
      "id": {
        "time": "2023-10-04T08:50:13.677Z"
      },
      "etag": "\"rQ3qpTrpjMqlOD9Fi6ZCgnpo6zAdUtM4Y4wU0J6c8Yw/UiNqGB-f4anaOLIVD9ya9Z-pAP0\"",
      "events": {},
      "actor": {}
    },
    "event": {
      "id": "-8909398197392254316",
      "created": "2023-10-04T08:50:25.347Z",
      "original": "{\"id\":{\"applicationName\":\"drive\",\"customerId\":\"C00hvn0vt\",\"time\":\"2023-10-04T08:50:13.677Z\",\"uniqueQualifier\":\"-8909398197392254316\"}"}"
    },
    "@timestamp": "2023-10-04T08:50:13.677Z",
  },
}

Second:

{
  "_index": "google_ws-2023.10.04",
  "_id": "QePm-YoBq7bjVLXLMFU_",
  "_source": {
    "json": {
      "id": {
        "time": "2023-10-04T08:50:13.677Z"
      },
      "etag": "\"rQ3qpTrpjMqlOD9Fi6ZCgnpo6zAdUtM4Y4wU0J6c8Yw/UiNqGB-f4anaOLIVD9ya9Z-pAP0\"",
      "events": {},
      "actor": {}
    },
    "event": {
      "created": "2023-10-04T08:55:25.376Z",
      "original": "{\"id\":{\"applicationName\":\"drive\",\"customerId\":\"C00hvn0vt\",\"time\":\"2023-10-04T08:50:13.677Z\",\"uniqueQualifier\":\"-8909398197392254316\"}"}"

    },
    "@timestamp": "2023-10-04T08:50:13.677Z",
  }
}

I.e. there is no uniqueQualifier, applicationName, customerId under the “json.id” key, as supposed to be, while they all still exists under the “event.original.id” key.

So could you please tell how this can be fixed?

elasticmachine commented 3 weeks ago

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

ShourieG commented 1 week ago

Hi @rlevytskyi, The deduplication fix for this was merged quite a while back in this PR based on this feedback. The unique id for deduplication is no longer in the json.id object but it's a finger print that lies in the _id field of the document. The main issue here seems to be it's generating unique _id's for duplicate events which is weird. Could you confirm if this is the case or not ?

rlevytskyi commented 1 week ago

Yes I can confirm it's the case, different _id for the same event.

ShourieG commented 19 hours ago

After investigation of this and similar issues, we've observed the following:

  1. Duplicate issues were significantly reduced and fixed for the most part after this PR was merged.

  2. Duplication issues seem to be more relevant with the workspace module when compared with the workspace integration.

  3. The google workspace module uses a fingerprint processor that does not support canonical ordering of the event object keys, this was recently fixed with this PR and should help reduce duplication going forward.

  4. The duplication issue talked about in this current issue seems to stem from issues outside our control and the involvement of Logstash or some issue that is causing the ingest pipeline to not work as expected as the presence of the "_source" object in the resulting documents suggest that the pipeline did not remove them correctly. Also fields inside the "_source" are missing which is leading to different fingerprints for the same document.

We are keeping this issue open to see if the duplication issue persists following the recent PR fix and we will also introduce an enhancement for adding conditional canonical sorting of keys to the fingerprint processor soon. cc: @narph

rlevytskyi commented 18 hours ago

Thank you! I'll test once it will be merged to the new version.