Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.22k stars 1.05k forks source link

Pipeline does not recognize streams are from different indices #5016

Open philippri opened 5 years ago

philippri commented 5 years ago

I tried to create different views of the same log data by creating two streams assigned to two different index sets. When manipulating one of these streams using a processing pipeline, data in the other stream is being manipulated, too. The pipeline seems to ignore that it is connected to a single stream and processes all versions of a message in any available stream. This bug report was written as advised over at: https://community.graylog.org/t/anonymized-and-raw-views-of-same-logs-in-different-streams-possible/

Expected Behavior

I expected the processing pipeline to only affect the stream it is connected to, especially given a stream in a seperate index set.

Current Behavior

Instead of only manipulating the log data in the stream the pipeline is connected to, it affects all copies of the events in all index sets.

Steps to Reproduce

  1. create two new index sets
  2. create two new streams, each attached to one of the new index sets
  3. write identical stream rules for the new streams so that the data they contain is identical
  4. create a new processing pipeline connected to only one of the new streams
  5. in this processing pipeline, add a rule that causes a visible change to the log data
  6. compare the data in the new streams; unexpectedly, the pipeline should affect both of them and the default index set

Context

I am trying to create two views of the log data to set up a system that is GDPR compliant. The views of the logs meant to be used day by day should be anonymized while the raw data is available separately if tracking down an attacker or similar measures are neccessary.

Your Environment

Message Processor Configuration:

  1. Message Filter Chain (active)
  2. Pipeline Processor (active)
  3. AWS Instance Name Lookup (disabled)
  4. GeoIP Resolver (active)
DerPhlipsi commented 5 years ago

I think this is due to "the stream being a child of the message" in the data flow.

Messages are assigned to streams by a meta field containing the stream IDs in each message and Graylog simply applies a filter on these IDs if only a specific stream should be returned. This causes the updates made by any pipeline function, which are applied based on a message ID, to update the message content without regard to the different streams.

The correct way to decouple changes made by different streams/pipelines would be to use the clone_message([message: Message]) function to create a new message, resulting in the changes to be made on different messages for their respective streams.

Feel free to correct me if I'm wrong. But I think this is the issue at hand here.

Greetings, Philipp

philippri commented 5 years ago

I managed to implement different views of the same data using the mechanisms @DerPhlipsi suggested, thanks again for that. If interested, please have a look at https://community.graylog.org/t/anonymized-and-raw-views-of-same-logs-in-different-streams-possible/ for details. I am not closing this issue, though, as what @jalogisch told me over at the community page makes me think the behaviour described in the above issue is not intended.

hulkk commented 5 years ago

This bug is still valid (with 2.4.6, and because the issue is still open, it might be valid in 2.5.x and 3.x).

Pipelines have global impact instead of stream specific impact.