elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
72 stars 3.5k forks source link

Warning Message when insufficient space to hold multiple persistent queues on a file system could be clearer #14839

Open robbavey opened 1 year ago

robbavey commented 1 year ago

logstash version

Logstash 7.x >= 7.17.5 Logstash 8.x >= 8.3.0

Steps to reproduce:

  1. Configure multiple pipelines where the total PQ requested is greater than the remaining number of bytes on allocated disk
  2. Start logstash

When starting up, Logstash will check the total amount of space required for PQ's on a specified file system against the amount of disk left on that file system, logging a warning when the total amount of space is exceeded.

However, the warning message emitted is difficult to follow and provide the correct remediating action:

I set up a config on my laptop where I have 312Gi free on my local drive, and configured two pipelines, each with

queue.max_bytes: 300gb

configured and started up logstash.

I received the following warning message:

[2023-01-11T17:43:50,148][WARN ][logstash.persistedqueueconfigvalidator] The persistent queue on path "/Users/robbavey/logstash-8.5.0/data/queue/test" won't fit in file system "/dev/disk1s2" when full. Please free or allocate 643171352576 more bytes. The persistent queue on path "/Users/robbavey/logstash-8.5.0/data/queue/test2" won't fit in file system "/dev/disk1s2" when full. Please free or allocate 643171352576 more bytes.

This number - Please free or allocate 643171352576 more bytes. - feels a little confusing as I actually need fewer bytes than that to successfully allow the PQ's to operate.

The number appears to (Total Size of required disk across all PQ) - (disk used across all PQ)

But the disk may not be dedicated to PQ and the number may be misleading.

It may be more useful to report

It may also be worth strengthening the warning to state that Logstash may fail to start if this is not resolved

donoghuc commented 1 week ago

As i'm still learning these concepts, can I extend your example to check my understanding of this proposed improvement for you to review before moving forward with an implementation?

In your example we configure two pipelines with persistent queues. The relevant config settings are: PQ1

queue.max_bytes: 300gb
queue.path: /Users/robbavey/logstash-8.5.0/data/queue/test1

PQ2

queue.max_bytes: 300gb
queue.path: /Users/robbavey/logstash-8.5.0/data/queue/test1

The LogStash::PersistetedQueueConfigValidator#check https://github.com/elastic/logstash/blob/046ea1f5a8a5e26f82d1495bb2ed9a7a3fe96332/logstash-core/lib/logstash/persisted_queue_config_validator.rb#L36-L66 method is used for computing resource requirements relevant for this warning.

Currently for each of these configs we read in the max_bytes from the config and use the path to determine what filesystem that queue will occupy. In our case both /Users/robbavey/logstash-8.5.0/data/queue/test1 and /Users/robbavey/logstash-8.5.0/data/queue/test2 will occupy /dev/disk1s2. Given we have set the max_bytes greater than the capacity of /dev/disk1s2 we add to the total for that filesystem https://github.com/elastic/logstash/blob/046ea1f5a8a5e26f82d1495bb2ed9a7a3fe96332/logstash-core/lib/logstash/persisted_queue_config_validator.rb#L63-L65.

There are several shortcomings for this approach:

  1. The filesystem required bytes for PQs only appears to be updated when one of the configs is computed to not have enough space.
  2. We assume that the filesystem is dedicated to just PQ storage which is probably not a likely assumption.
  3. It is not clear how to action this as a consumer of the warning.

Proposed improvement: Here is a proposed example warning

The `max_bytes` allocated for persistent queues for filesystem '/dev/dist1s2' exceed available space.
'/dev/dist1s2' filesystem status:
- Total space required: 600gb
- Currently free space: 312gb
- Current PQ usage: 50gb
- Additional space needed: 288gb

Individual queue requirements on  '/dev/dist1s2'  filesystem:
 /Users/robbavey/logstash-8.5.0/data/queue/test1:
    Current size: 30gb
    Maximum size: 300gb
  /Users/robbavey/logstash-8.5.0/data/queue/test1:
    Current size: 20gb
    Maximum size: 300gb

Please either:
1. Free up disk space
2. Reduce queue.max_bytes in your pipeline configurations
3. Move PQ storage to a filesystem with more available space
Note: Logstash may fail to start if this is not resolved.

What this would involve is passing through all the configs to build up all the required max_bytes and find all the paths. Once we have this information we can partition by the file systems and build a warning (in our example it is just a single filesystem, but there could be several if paths are configured on multiple filesystems).

Implementation notes: I think from reviewing the existing methods in that class we could compute the space available on a file system as well as the space used under a given queue path. This should allow us to distinguish between what we are explicitly using for logstash queue storage vs what may be used by other entities on the system. I see this util module https://github.com/elastic/logstash/blob/046ea1f5a8a5e26f82d1495bb2ed9a7a3fe96332/logstash-core/lib/logstash/util/byte_value.rb#L57-L73 which i think would be nicer for human readable bytes.