Open ggrossetie opened 9 years ago
Each time pipeline.instance_eval
is called, LogStash::Pipeline.new(config)
is invoked.
I've modified logstash_helpers.rb
to add timings:
let(:results) do
puts "do results"
results = []
start = Time.now
pipeline.instance_eval { @filters.each(&:register) }
puts "pipeline.instance_eval in #{Time.now - start} s"
start = Time.now
event.each do |e|
# filter call the block on all filtered events, included new events added by the filter
pipeline.filter(e) { |filtered_event| results << filtered_event }
end
puts "pipeline.filter in #{Time.now - start} s"
start = Time.now
# flush makes sure to empty any buffered events in the filter
pipeline.flush_filters(:final => true) { |flushed_event| results << flushed_event }
puts "pipeline.flush_filters in #{Time.now - start} s"
start = Time.now
results.select { |e| !e.cancelled? }
puts "results.select in #{Time.now - start} s"
end
And here is the output:
do results
pipeline.instance_eval in 19.147 s
pipeline.filter in 0.018 s
pipeline.flush_filters in 0.001 s
results.select in 0.0 s
do results
pipeline.instance_eval in 17.688 s
pipeline.filter in 0.003 s
pipeline.flush_filters in 0.0 s
results.select in 0.0 s
do results
pipeline.instance_eval in 17.342 s
pipeline.filter in 0.002 s
pipeline.flush_filters in 0.0 s
results.select in 0.0 s
do results
Actually the value is cached across multiple calls in the same sample but not across samples. See: https://www.relishapp.com/rspec/rspec-core/v/3-3/docs/helper-methods/let-and-let
@Mogztter - would it help if the sample method allowed you to pass in a pipeline object?
def sample(sample_event, user_pipeline = nil , &block)
name = sample_event.is_a?(String) ? sample_event : LogStash::Json.dump(sample_event)
name = name[0..50] + "..." if name.length > 50
describe "\"#{name}\"" do
let(:pipeline) { user_pipeline || LogStash::Pipeline.new(config) }
...
....
end
end
However I would be worried about the pipeline having some in-flight events internally. The pipeline object does not have a reset method.
Obviously you want a clean setup for RSpec.
@guyboertje Yes this is one way of solving the problem but I think "end user" should not be aware of LogStash internal (pipeline).
It would be nice if we could be able to execute multiple samples against the same configuration using Rspec context
.
I think this make sense to state in this context (with this configuration) I was to execute sample A and B. Behind the scene a pipeline will be instantiate and configure only once to execute sample A and B.
If I want to execute sample A with another configuration, I can declare another context with another configuration.
@Mogztter - in this case the "end user" has the role of test case developer and with this role comes some degree of responsibility for the objects under test. Presently, the sample method hides the fact that 'object under test' is really a (new) pipeline instance in a RSpec let
block (preserving the clean state for each it
block contract). As soon as that is deviated from, the test case developer must take responsibility for the 'object under test'.
I know what you want to achieve, but I think there is a strong chance of 'not clean slate' tests.
Also, FWIW, this issue is the correct place to generally discuss this (not your PR, as it is only one possible solution to this issue).
I am okay with having a way to execute multiples test with a single pipeline instance, I understand the cost of bringing one up on every tests. But the problem with the #sample
method is the context, it is used by plugin author, core developers and config developers and they don't have the same requirements.
By requirements, I mean they expect a clean state on every calls of sample which is the current behavior and this is usually what you want when you when are developping the system, you want to make sure that every call is independent of the other, one way to do it is to have a new pipeline under evaluation.
On the other hand, You as the devops/user, you want want fast feedback against the same configuration and the sample method doesn't provide it out of the box.
@guyboertje @Mogztter maybe having a subset of specialized helpers targeted at config developers would be better?
@Mogztter all this you're fighting are rspec mechanisms to ensure isolation in tests. The reason you're fighting it is because logstash is slow when setting up the filters :(
I wonder if we can profile the filter register calls and see which ones are making this delay so big.
@Mogztter the problem in changing sample
behavior can impact other plugins or core tests.
@ph - Note: @Mogztter is also (in the PR) proposing to change the config
helper method behaviour.
I was thinking of this:
let(:config) { }
samples do |p|
# or let(:config) { }
sample("apache log in debug", p) do
#expectation
end
sample("other", p) do
# expectation
end
end
where the samples
method provides the 'clean' pipeline. Also, by injecting it into sample
here and not providing it elsewhere we retain backwards compatibility.
There is a way to hide the (reused) pipeline provided by samples
but it means more work in sample
All of the above is dependent on a way to reset or light-weight clone a pipeline instance when receiving it in sample
.
@ph - yes perhaps a tailored helpers is a good idea.
in this case the "end user" has the role of test case developer and with this role comes some degree of responsibility for the objects under test. Presently, the sample method hides the fact that 'object under test' is really a (new) pipeline instance in a RSpec let block (preserving the clean state for each it block contract). As soon as that is deviated from, the test case developer must take responsibility for the 'object under test'.
I agree but I would prefer to use an utility method rather than instantiating and configuring manually a Pipeline object.
Also, FWIW, this issue is the correct place to generally discuss this (not your PR, as it is only one possible solution to this issue).
The idea is not to merge this pull request but to highlight what part of the code need to be enhanced.
I am okay with having a way to execute multiples test with a single pipeline instance, I understand the cost of bringing one up on every tests. But the problem with the #sample method is the context, it is used by plugin author, core developers and config developers and they don't have the same requirements.
I understand
maybe having a subset of specialized helpers targeted at config developers would be better?
Sounds great.
The reason you're fighting it is because logstash is slow when setting up the filters
The most costly part of the code is LogStash::Pipeline.new(configstr)
. On my output pipeline.instance_eval
took 19 seconds because the Pipeline was lazily instantiate before instance_eval invocation.
I wonder if we can profile the filter register calls and see which ones are making this delay so big.
That would be nice and will also make LogStash starts faster.
A simple enhancement could be to create a new method static_config
. This method will instantiate and configure a Pipeline and use this pipeline for the subsequent sample
calls.
The behavior of sample
will remain the same with config
.
This is actually a pain, the way sample works and used is very slow, but for now is a way used in tons of places in logstash, but I see here several open questions.
And at the end we've full pipeline test, what I think people who aim to test a configuration would really need. This would certainly also benefit from a sample alike method, although I would most probably change the implementation of it to a direction closer to how the input helper works
What do you think?
I wonder @Mogztter, what is the main use case why do you write test? are you willing to test plugins? a given configuration for logstash?
[1] input helper usage example: https://github.com/logstash-plugins/logstash-input-relp/blob/master/spec/inputs/relp_spec.rb#L28-L56
I wonder @Mogztter, what is the main use case why do you write test? are you willing to test plugins? a given configuration for logstash?
Our main use case is to catch regressions in our custom patterns and/or in our LogStash filters configuration. We also use our test suites to add new messages in order to validate that LogStash will properly extract and transform the data. Finally, since we have a fairly complex configuration, we also catch regressions in LogStash itself (plugins) :smile:
So basically we are doing integration tests to make sure that everything works.
Our filters configuration is split by message type but we also have common filters. So we could create smaller configuration files for each message type but it's faster to load/compile the full configuration and start JRuby only once.
Hi, I started today, well, restarted poking around this kind of issues. I wanted to share my experiments of today.
In https://github.com/purbon/logstash-helpers I started creating a gem that basically aims to be a collection of helpful helpers and matchers for logstash configuration, you could see for example an input helper or a grok/patterns matcher. This aims to easy and organize your experience with test in the project.
In https://github.com/purbon/logstash-input-syslog/blob/feature/use-logstash-helpers/spec/inputs/syslog_spec.rb, I did an small example, experiment on having a pipeline helper compile the configuration only once (still could be improved a bit). This is useful, until the configuration change when is necessary to redo this process.
some numbers run in my laptop with the syslog input.
This means approx an speed up of 47%, worth persuading !
I will keep you up to date with the evolution of this, it would be super nice to keep in touch to now how you are using the test more in detail. From your last comment I see:
Thanks for your feedback @Mogztter !
@Mogztter I was wondering if you are able to share your config so I might be able to do some work on improving the expectations of people like you doing this kind of test. Makes sense to you?
/cheers
Configuration validation? Do you test the whole pipeline here? do you start testing with stdin as input, do the set of filters and output to stdout?
I really just want to assert that when a message is received, data are correctly extracted/transformed. I'm testing the filter
part of the configuration not the input
nor the output
.
From an API point of view, the method config
and sample
are perfect because I'm just providing a configuration file (containing filters) and messages (samples) and then I can write assertions.
Validate custom patterns. Do you know about the patterns spec helpers ? you will also find similar matchers in the project I shared with you earlier.
Yes but I can validate both the custom patterns and the filter configuration at once. But I do agree that this is not unit testing.
Configuration validation? Do you test the hole pipeline here? do you start testing with stdin as input, do the set of filters and output to stdout?
I really just want to test the filter part of the configuration, I don't care about the input
or the output
. When a message (sample) is received I want to assert that the data are extracted/transformed correctly.
From an API point of view, the config
and sample
methods are perfect. I'm providing a configuration file and a sample and then I can write my assertions.
We have a quite complex configuration file (approximately 1500 lines) and I believe that this line is costly: https://github.com/elastic/logstash/blob/master/lib/logstash/pipeline.rb#L26
If I understand correctly this code, a new pipeline is created for every
sample()
method, meaning that the configuration is compiled every time.Meaning that the following test will load and compile 3 times the same configuration:
Would it be possible to configure this behavior ? We have 300+ samples and running our tests suite is really slow. Having said that I think that this behavior should be the default. If I do want to use a different configuration I can call
config
again with the new configuration, no ?