Azure / logstash-output-kusto

Logstash output for Kusto
Apache License 2.0
13 stars 14 forks source link

Save files based on processing time and not event time #44

Open KhaoticMind opened 2 years ago

KhaoticMind commented 2 years ago

Currently the output plugin saves temporary files (the ones that will be sent do ADX) based on the @timestamp field of the events.

When working with a large environment we may have cases where devices aren't fully time synced and might send events close to each other but with very different timestamp values. In this case the plugin will end up creating various small files to send to ADX, what can increase the load on the cluster (various small files being ingested) and increase the cost of the service (various small files end-up triggering various write-operations that accumulate to the total value).

We were facing this issue in our environment and by customizing the filter step on logstash like bellow we forced all events to be write to the file base on the processing time, and not the event time. This helped as cut reduce the write operations costs by 95% (yeah, ninity-five percent!).

   mutate {
      copy => { "@timestamp" => "event_timestamp" }
   }
   ruby {
      code => "event.set('logstash_processed_at', Time.now());"
   }
   mutate {
      copy => { "logstash_processed_at" => "@timestamp" }
   }
   mutate {
      remove_field  => ["logstash_processed_at"]
   }

Before the changes we saw various small files (in kilobytes) being ingested every minute, and now have just one file with 100-200Mb per minute.

While we know that we need to fix the time of the devices sending the data, this issue might also happen because of buffering and send delays (network disconnects and what else). @avneraa and @ag-ramachandran are aware of our situation. Unfortunately, we didn't have the time to try to change the plugin code to contribute with a PR.