elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.25k stars 24.86k forks source link

Fingerprint Processor Unexpected Results #98339

Open neu5ron opened 1 year ago

neu5ron commented 1 year ago

Elasticsearch Version

8.9.0, tested also on 8.5 and 8.6

Installed Plugins

No response

Java Version

bundled

OS Version

N/A

Problem Description

When using the fingerprint processor there are unexpected results with showing the actual method's hex representation. For example using the method MD5 and the value a.

Expected: hex: 0cc175b9c0f1b6a831c399e269772661 base64: DMF1ucDxtqgxw5niaXcmYQ== Fingerprint Processor: hex: 7687355dbc955b0074758acb4d5f9a base64: dg91NXbylVsAdHWKy01fpg==

Steps to Reproduce

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "fingerprint": {
          "fields": ["a"],
          "method": "MD5",
          "target_field": "test"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
          "a": "a"
      }
    }
  ]
}

Logs (if relevant)

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "test": "dg91NXbylVsAdHWKy01fpg==",
          "a": "a"
        },
        "_ingest": {
          "timestamp": "2023-08-10T06:19:10.716599155Z"
        }
      }
    }
  ]
dreamquster commented 1 year ago

The response of ES is right. It's not just simplely calculate the MD5 of 'a', but concatenate all values of 'fileds' with a delimeter of byte '0'. So its result is more like this function = Base64(MD5(join(0, value of fields)

neu5ron commented 1 year ago

ok,is there a possibility to add an option to change this or. Have years of data with fingerprints/hashes and moving everything to ingest pipeline the fingerprinting does not match with logstash or previous ETL provided by Elastic.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-data-management (Team:Data Management)

neu5ron commented 1 year ago

it would be great to have consistent hashes over the years. thank you!

neu5ron commented 1 year ago

or at least make it not add a null byte if hashing a single field.

g0tr3wt commented 1 year ago

Bump 🥶

neu5ron commented 5 months ago

hi I was wanting to follow up on this issue. I know this may be expected results as it was built for elasticsearch fingerprint process. However, this is not how it works for logstash or filebeat. Also, it makes it difficult for a field like cyber security where it is necessary to share hashes across communities and environments of all sorts of technology - and if those of us using Elastic are sharing inconsistent hashes with the community then it puts us in a difficult position. I continue to see the fingerprint processor be used (as recent as 2 days ago) in Elastic ingest pipelines for ECS - and I know this issue will only continue to grow in the future.

Personally myself, I have solved this - I have found an undocumented hashing technique outside of a processor by using painless. However, I don't want the majority of the community using Elastic to continue to be in this realm of separation of sharing incorrect intel..