Ability to dedupe log messages

ghost commented 10 years ago

It would be nice to have the ability to dedupe identical log messages. This would be particularly useful for folks who use the raw tcp method for sending logs - as example "cat /var/log/messages | nc graylog2-server 5555".

lennartkoopmann commented 10 years ago

Thank you! Great idea.

kroepke commented 10 years ago

Do you mean dedup of identical, consecutive messages, like some syslog systems behave?

ghost commented 10 years ago

The dedupe I'm talking about would be in the sense of feeding graylog the same, identical '/var/log/messages' file twice but only keeping one copy of the messages. Or it could be described as feeding graylog a /var/log/messages file in which you are only interested in logging unique lines which have not been entered to graylog yet. i.e, at 12:00 noon you piped /var/log/messages from host foo to a tcp port on graylog and again at 6:00PM you piped /var/log/messages from the same host to a tcp port on graylog. I'd like to record only the differences.

On Tue, May 27, 2014 at 5:40 PM, Kay Roepke notifications@github.comwrote:

Do you mean dedup of identical, consecutive messages, like some syslog systems behave?

— Reply to this email directly or view it on GitHubhttps://github.com/Graylog2/graylog2-server/issues/466#issuecomment-44284986 .

kroepke commented 10 years ago

Unfortunately that is very expensive to do, because it would mean to have to search for every message that comes in, and deciding whether it has already been seen, at least when using an approach like netcat.

The only way I could see it working would be to register and fingerprint files (or rather portions of files), and then being able to re-sync into them. What kind of data are you looking to treat like this?

deepybee commented 10 years ago

This is a sorely missing feature for me, and one of the things I miss most from having used Splunk previously.

In my current use case I'm having to hit an api on a per second basis to return a response which has a massive variance profile, so the majority of the time will return identical values as the previous poll but if, for example, a client is doing a bulk import, it will return varying responses in very short timescales.

I can see how it would be an expensive process, so perhaps it could be enabled in the config file like other resource heavy processes, such as match highlighting?

I see there's no stated time limit on any milestones including gemini, is there an order of magnitude in mind at least? Months / quarters / years?

kroepke commented 10 years ago

Gemini etc will be empty soon, we are going back to a version based milestone planning, but won't do more than one or two at a time. It proved too volatile to do long term planning.

As to deduplication, I guess we could do it when displaying, as the results are actually in order already. As an implementation note, it might be possible to treat it the same way we were planning to treat the multi-index case (such as per-stream indices with overlapping stream contents). By having a bloom filter for each result set, we could efficiently filter out the duplicates, on the assumption that dupes are rare. To keep the requested limit of messages we could issue more queries should we filter out many messages.

brandongalbraith commented 9 years ago

@kroepke Could this be implemented as an optional background task instead of performed on-the-fly?

lindonm commented 5 years ago

What about something that ran periodically (user configured) against the indexes? https://www.elastic.co/blog/how-to-find-and-remove-duplicate-documents-in-elasticsearch

In my use case, similar to the API comment above, I will end up with a lot of "overlap" - The messages will be unique (identified by a couple of columns) but they will be imported twice.

Having something that could query elasticsearch looking for duplicates and then removing them would be great. My coding sucks, but I envision something like:

Select * from index | Distinct | Where Count > 1 | For Each (delete all except first))

ag-michael commented 5 years ago

this is a sorely missing feature. perhaps a rule function that can do this would be useful. I can see this being made less expensive using bloom filters or similar algorithms.

For my use case, I'd like to be able to define values which are possibly pulled via a data adapter that caches/queries a stream and look them up against incoming stream messages and do things with those messages like dropping them for the purpose of de-duplication.

I've also wanted to use a similar functionality to perform correlations. Example: network traffic stream would show failed login attempts for a user account , an end host monitoring log stream would alert when suspicious commands are executed for that user after a threshold of failed login events are seen in the first stream. I've tried to use the stream lookup (slookup) plugin to no avail but it would be excellent if we can use other streams as data adapters in lookup tables where the results are cached to avoid re-running searches too frequently.

konvergence commented 5 years ago

Hi all,

When you use logstash directly on ES, you can force the document_id with a fingerprint function. This allow on elasticsearch to not overwrite the existing _id.

But i know that graylog manage itself the _id . So may be it is not possible to force it with a fingerprint function.

Example into the filter: fingerprint { source => "message" target => "fingerprint" method => "SHA1" key => "shuttle $ audit " base64encode => true }

Example into output: elasticsearch{ ssl => ${EL_SSL} ssl_certificate_verification => ${EL_SSL_VERIF_CERT} hosts => "${EL_HOST}:${EL_PORT}" index => "${EL_INDEX}" document_id=>"%{fingerprint}" user => ${EL_USER} password => #{ES_PASSWORD} }

konvergence commented 5 years ago

A approach will be to do it on stream when you choose another index.

By this way graylog continue to receive the message into the default index. But if you choose another index, you could compute the _id, gl2_document_id on a fingerprint function and force the Timestamp field on an exiting date field in the message.

hafeidejiangyou commented 4 years ago

I want to dedupe log because when I use http to send log to graylog, in order not to lose messages when the network is disconnected，client will resend the log to graylog，but this log was received by graylog because of the characteristics of tcp.

I think it is worth to add bloom filters or similar algorithms .

JoeHsu092015 commented 4 years ago

A approach will be to do it on stream when you choose another index.

By this way graylog continue to receive the message into the default index. But if you choose another index, you could compute the _id, gl2_document_id on a fingerprint function and force the Timestamp field on an exiting date field in the message.

I add a rule on the pipeline, but it didn't work.

rule "document_id" when has_field("full_message") then let document_id = sha1(to_string($message.full_message)); set_field("_id", document_id); set_field("gl2_message_id", document_id); end

Graylog2 / graylog2-server

Ability to dedupe log messages #466