LogStash configuration is compiled for every sample

ggrossetie commented 9 years ago

We have a quite complex configuration file (approximately 1500 lines) and I believe that this line is costly: https://github.com/elastic/logstash/blob/master/lib/logstash/pipeline.rb#L26

If I understand correctly this code, a new pipeline is created for every sample() method, meaning that the configuration is compiled every time.

Meaning that the following test will load and compile 3 times the same configuration:

require "test_utils"

describe "apache common log format" do
  extend LogStash::RSpec

  config %q{
    # Your logstash filter config here
  }

  sample "apache log in debug" do
    insist { subject["level"] } == "DEBUG"
  end
  sample "apache log in info" do
    insist { subject["level"] } == "INFO"
  end
  sample "apache log with stack trace" do
    insist { subject["stacktrace"] } == "oups!"
  end
end

Would it be possible to configure this behavior ? We have 300+ samples and running our tests suite is really slow. Having said that I think that this behavior should be the default. If I do want to use a different configuration I can call config again with the new configuration, no ?

ggrossetie commented 9 years ago

Each time pipeline.instance_eval is called, LogStash::Pipeline.new(config) is invoked. I've modified logstash_helpers.rb to add timings:

      let(:results) do
        puts "do results"
        results = []
        start = Time.now
        pipeline.instance_eval { @filters.each(&:register) }
        puts "pipeline.instance_eval in #{Time.now - start} s"

        start = Time.now
        event.each do |e|
          # filter call the block on all filtered events, included new events added by the filter
          pipeline.filter(e) { |filtered_event| results << filtered_event }
        end
        puts "pipeline.filter in #{Time.now - start} s"

        start = Time.now
        # flush makes sure to empty any buffered events in the filter
        pipeline.flush_filters(:final => true) { |flushed_event| results << flushed_event }
        puts "pipeline.flush_filters in #{Time.now - start} s"

        start = Time.now
        results.select { |e| !e.cancelled? }
        puts "results.select in #{Time.now - start} s"
      end

And here is the output:

do results
pipeline.instance_eval in 19.147 s
pipeline.filter in 0.018 s
pipeline.flush_filters in 0.001 s
results.select in 0.0 s
do results
pipeline.instance_eval in 17.688 s
pipeline.filter in 0.003 s
pipeline.flush_filters in 0.0 s
results.select in 0.0 s
do results
pipeline.instance_eval in 17.342 s
pipeline.filter in 0.002 s
pipeline.flush_filters in 0.0 s
results.select in 0.0 s
do results

ggrossetie commented 9 years ago

Actually the value is cached across multiple calls in the same sample but not across samples. See: https://www.relishapp.com/rspec/rspec-core/v/3-3/docs/helper-methods/let-and-let

guyboertje commented 9 years ago

@Mogztter - would it help if the sample method allowed you to pass in a pipeline object?

  def sample(sample_event, user_pipeline = nil , &block)
    name = sample_event.is_a?(String) ? sample_event : LogStash::Json.dump(sample_event)
    name = name[0..50] + "..." if name.length > 50

    describe "\"#{name}\"" do
      let(:pipeline) { user_pipeline || LogStash::Pipeline.new(config) }
...
....
    end
end

However I would be worried about the pipeline having some in-flight events internally. The pipeline object does not have a reset method.

Obviously you want a clean setup for RSpec.

ggrossetie commented 9 years ago

@guyboertje Yes this is one way of solving the problem but I think "end user" should not be aware of LogStash internal (pipeline).

It would be nice if we could be able to execute multiple samples against the same configuration using Rspec context. I think this make sense to state in this context (with this configuration) I was to execute sample A and B. Behind the scene a pipeline will be instantiate and configure only once to execute sample A and B.

If I want to execute sample A with another configuration, I can declare another context with another configuration.

guyboertje commented 9 years ago

@Mogztter - in this case the "end user" has the role of test case developer and with this role comes some degree of responsibility for the objects under test. Presently, the sample method hides the fact that 'object under test' is really a (new) pipeline instance in a RSpec let block (preserving the clean state for each it block contract). As soon as that is deviated from, the test case developer must take responsibility for the 'object under test'.

I know what you want to achieve, but I think there is a strong chance of 'not clean slate' tests.

Also, FWIW, this issue is the correct place to generally discuss this (not your PR, as it is only one possible solution to this issue).

ph commented 9 years ago

I am okay with having a way to execute multiples test with a single pipeline instance, I understand the cost of bringing one up on every tests. But the problem with the #sample method is the context, it is used by plugin author, core developers and config developers and they don't have the same requirements.

By requirements, I mean they expect a clean state on every calls of sample which is the current behavior and this is usually what you want when you when are developping the system, you want to make sure that every call is independent of the other, one way to do it is to have a new pipeline under evaluation.

On the other hand, You as the devops/user, you want want fast feedback against the same configuration and the sample method doesn't provide it out of the box.

ph commented 9 years ago

@guyboertje @Mogztter maybe having a subset of specialized helpers targeted at config developers would be better?

jsvd commented 9 years ago

@Mogztter all this you're fighting are rspec mechanisms to ensure isolation in tests. The reason you're fighting it is because logstash is slow when setting up the filters :(

I wonder if we can profile the filter register calls and see which ones are making this delay so big.

ph commented 9 years ago

@Mogztter the problem in changing sample behavior can impact other plugins or core tests.

guyboertje commented 9 years ago

@ph - Note: @Mogztter is also (in the PR) proposing to change the config helper method behaviour.

I was thinking of this:

let(:config) { }
samples do |p|
  # or let(:config) { }
  sample("apache log in debug", p) do
    #expectation
  end
  sample("other", p) do
    # expectation
  end
end

where the samples method provides the 'clean' pipeline. Also, by injecting it into sample here and not providing it elsewhere we retain backwards compatibility.

There is a way to hide the (reused) pipeline provided by samples but it means more work in sample

All of the above is dependent on a way to reset or light-weight clone a pipeline instance when receiving it in sample.

guyboertje commented 9 years ago

@ph - yes perhaps a tailored helpers is a good idea.

ggrossetie commented 9 years ago

in this case the "end user" has the role of test case developer and with this role comes some degree of responsibility for the objects under test. Presently, the sample method hides the fact that 'object under test' is really a (new) pipeline instance in a RSpec let block (preserving the clean state for each it block contract). As soon as that is deviated from, the test case developer must take responsibility for the 'object under test'.

I agree but I would prefer to use an utility method rather than instantiating and configuring manually a Pipeline object.

Also, FWIW, this issue is the correct place to generally discuss this (not your PR, as it is only one possible solution to this issue).

The idea is not to merge this pull request but to highlight what part of the code need to be enhanced.

I am okay with having a way to execute multiples test with a single pipeline instance, I understand the cost of bringing one up on every tests. But the problem with the #sample method is the context, it is used by plugin author, core developers and config developers and they don't have the same requirements.

I understand

maybe having a subset of specialized helpers targeted at config developers would be better?

Sounds great.

The reason you're fighting it is because logstash is slow when setting up the filters

The most costly part of the code is LogStash::Pipeline.new(configstr). On my output pipeline.instance_eval took 19 seconds because the Pipeline was lazily instantiate before instance_eval invocation.

I wonder if we can profile the filter register calls and see which ones are making this delay so big.

That would be nice and will also make LogStash starts faster.

A simple enhancement could be to create a new method static_config. This method will instantiate and configure a Pipeline and use this pipeline for the subsequent sample calls.

The behavior of sample will remain the same with config.

purbon commented 9 years ago

This is actually a pain, the way sample works and used is very slow, but for now is a way used in tons of places in logstash, but I see here several open questions.

Is sample something to be used for all plugins? definitely not, but this days is used everywhere. For inputs nowadays it's kinda necessary to spin up the pipeline as it makes the shutdown easier and uniform, for this you can use an input helper [1].
In situations where the config is the same I agree with the comments here, is not necessary to compile again the config, with the input helper you could just generate some events and then do the check, does this makes sense to you? Keep in mind this helper is rspec friendly not like sample that end up messing in the Rspec DSL.
There is also the discussion of spinning up the pipeline or not when doing test for the inputs? I should not be necessary in all cases, although this makes managing the input life cycle more complicated for now.
Then we've filters, they would not really have to do anything with the pipeline, here I would go for having specific helpers that simplify test process and semantics.
Same thing apply to outputs.

And at the end we've full pipeline test, what I think people who aim to test a configuration would really need. This would certainly also benefit from a sample alike method, although I would most probably change the implementation of it to a direction closer to how the input helper works

What do you think?

I wonder @Mogztter, what is the main use case why do you write test? are you willing to test plugins? a given configuration for logstash?

[1] input helper usage example: https://github.com/logstash-plugins/logstash-input-relp/blob/master/spec/inputs/relp_spec.rb#L28-L56

ggrossetie commented 9 years ago

I wonder @Mogztter, what is the main use case why do you write test? are you willing to test plugins? a given configuration for logstash?

Our main use case is to catch regressions in our custom patterns and/or in our LogStash filters configuration. We also use our test suites to add new messages in order to validate that LogStash will properly extract and transform the data. Finally, since we have a fairly complex configuration, we also catch regressions in LogStash itself (plugins) :smile:

So basically we are doing integration tests to make sure that everything works.

Our filters configuration is split by message type but we also have common filters. So we could create smaller configuration files for each message type but it's faster to load/compile the full configuration and start JRuby only once.

purbon commented 9 years ago

Hi, I started today, well, restarted poking around this kind of issues. I wanted to share my experiments of today.

In https://github.com/purbon/logstash-helpers I started creating a gem that basically aims to be a collection of helpful helpers and matchers for logstash configuration, you could see for example an input helper or a grok/patterns matcher. This aims to easy and organize your experience with test in the project.

In https://github.com/purbon/logstash-input-syslog/blob/feature/use-logstash-helpers/spec/inputs/syslog_spec.rb, I did an small example, experiment on having a pipeline helper compile the configuration only once (still could be improved a bit). This is useful, until the configuration change when is necessary to redo this process.

some numbers run in my laptop with the syslog input.

With the actual set of helpers took around 1.31 seconds to run the tests.
With the new (one time compile) input helper, took around 0.68 seconds.

This means approx an speed up of 47%, worth persuading !

I will keep you up to date with the evolution of this, it would be super nice to keep in touch to now how you are using the test more in detail. From your last comment I see:

Validate custom patterns. Do you know about the patterns spec helpers ? you will also find similar matchers in the project I shared with you earlier.
Configuration validation? Do you test the hole pipeline here? do you start testing with stdin as input, do the set of filters and output to stdout?

Thanks for your feedback @Mogztter !

purbon commented 9 years ago

@Mogztter I was wondering if you are able to share your config so I might be able to do some work on improving the expectations of people like you doing this kind of test. Makes sense to you?

/cheers

ggrossetie commented 9 years ago

Configuration validation? Do you test the whole pipeline here? do you start testing with stdin as input, do the set of filters and output to stdout?

I really just want to assert that when a message is received, data are correctly extracted/transformed. I'm testing the filter part of the configuration not the input nor the output.

From an API point of view, the method config and sample are perfect because I'm just providing a configuration file (containing filters) and messages (samples) and then I can write assertions.

ggrossetie commented 9 years ago

Validate custom patterns. Do you know about the patterns spec helpers ? you will also find similar matchers in the project I shared with you earlier.

Yes but I can validate both the custom patterns and the filter configuration at once. But I do agree that this is not unit testing.

Configuration validation? Do you test the hole pipeline here? do you start testing with stdin as input, do the set of filters and output to stdout?

I really just want to test the filter part of the configuration, I don't care about the input or the output. When a message (sample) is received I want to assert that the data are extracted/transformed correctly.

From an API point of view, the config and sample methods are perfect. I'm providing a configuration file and a sample and then I can write my assertions.

elastic / logstash-devutils

LogStash configuration is compiled for every sample #33