Open goncalotomas opened 4 years ago
Yikes! It's definitely possible to write that benchmark in a way that doesn't cause this problem, but it also shouldn't be up to the user to find the right way to do that. I'll definitely take a peek at this, as it's not a good user experience and absolutely needs to be fixed.
Thanks for the great report with all the repro steps!
:wave:
Thanks for the report!
Also, I wouldn't trust whatever output you received. Once it starts swapping performance will be abysmal where it swapped.
I'm surprised that it generated so much data though... but I mean, it might. I don't think it's related to the 50 inputs though. You should see the same behavior (or so I believe) with just one input but running it for 100 seconds (or just one function with 200 seconds run time). These are extremely fast and we record every single one of them, never throwing them away.
What's more is that in our post processing/statistics we sort all of the run time values which creates a copy of the list with all the run times.
I'm not sure we can viably do anything better here other than warning you that this might get out of hand by some metric. We could implement dumping and reloading each job from disk but that would/should be an option/separate thing. And it's also a lot of complexity/performance degradation for the "normal" use case.
With so damn fast functions I'd vote for massively reduced run times that should still give you ample sample size. Like.. 0.1
or something.
I wouldn't trust whatever output you received. Once it starts swapping performance will be abysmal where it swapped.
I'm pretty sure that this high memory usage only starts after the benchmarking completes, so I guess the measurements might remain OK.
You should see the same behavior (or so I believe) with just one input but running it for 100 seconds (or just one function with 200 seconds run time).
That makes sense given what I've seen. I think I tried it, but did not record the end result.
I'm not sure we can viably do anything better here other than warning you that this might get out of hand by some metric.
I think it's a valid question for any benchmarking tool. Perhaps a warning is appropriate, but I think this will lead to users just working around with something like :timer.sleep/1
or similar, which might ruin the results, or hide performance differences.
These are extremely fast and we record every single one of them, never throwing them away.
Implementation details aside, perhaps the best effort solution would be to reduce the sample rate when these very fast functions are detected. For example, if the number of iteration exceeds a certain threshold and we know the amount of data generated in each iteration, you could still keep executing the function, incrementing the iterations, but only sampling the rest of the metrics every N-th execution.
Regarding the log to disk feature, you can look at Basho Bench and how they log to disk. They log one file per different operation (equivalent to one Benchee input, I think) and a final summary file that aggregates data from multiple operations. I've attached an example of both of these in case it's interesting for you.
basho-bench-single-operation-log.txt basho-bench-summary-log.txt
@goncalotomas
that the time explosion comes after the benchmarks is interesting behavior, maybe I misread the initial post :thinking: That leads to thinking more that it's the statistics/sorting part. In that case we could offer simplified statistics or something.
Sampling the executions is a no go for me. I mean, that just means we're executing it for no real benefit. Might as well turn the time down and execute fewer of them.
As for logging to disk, as I said yes that would work and we even have the serialization/deserialization in place. It's a non trivial amount of work and potential problems though. For instance, if I'm not wrong right now benchee could run without ever having write permission to disk (if you use the console output). Which, is nice. Plus when the problem truly are our statistics computations then writing it to file might not even change that much :|
I'd have to take a deeper look to be certain though. Starting to throw warnings at an arbitrary/somewhat determined count of samples per scenario would be a good and doable first step though.
So I'm finding that there's a pretty steady increase in memory usage during just the collection phase (~6.4GB of memory use), and then we see the huge jump.
I'm thinking that in this case we've got two options, really:
By point 2, I'm thinking mostly about this old issue: https://github.com/bencheeorg/benchee/issues/9
That would be a huge optimization for folks since it would decrease the runtime dramatically, especially for benchmarks like this where there is little deviation and we'd hit a really high measure of confidence really quickly, and that would also have the side effect of making it so less memory is used for collecting the data and calculating the statistics.
I'm also thinking that since we calculate statistics in parallel, that might be contributing to the problem here a bit. In theory we could make that an option, but I'd prefer to avoid that sort of really low-level configuration.
What do y'all think?
Ugh good point about us doing that in parallel :D that means there are 100 sorts running in parallel here which really explains this.
Imo both 1 and 2 would be good to implement. Not sure how to come up with a good "barrier" for the warning for 1., 2. is of course one of my most wished for features. Iirc benchmark.js does this/can do this and I want us to be clearly the best :angel:
It has some interesting implementation considerations though which we might discuss in the other issue. Like how often do we check the confidence? After every invocation seems a bit much. My gut feeling would be kinda dependent on run time and the last computed confidence value. Lots of fine tuning there and maybe an option to expose. We might also be able to just cheat and see what benchmark.js or others do :)
Ah btw. great thought of fixing that through offering 2. @devonestes - never occurred to me but seems great :rocket: :star: :green_heart:
I tried out Benchee to test out the difference in performance between pattern matching in function clauses versus a more traditional approach. I built a dumb module to test this out which you can see here.
Here was my run script:
I understand that I could have just called both functions with random inputs, but I'd have to ensure that I used the same seed for the PRNG function. This was the dumbest (and quickest) way I found to set this up.
This creates effectively 50 inputs for each function, and even with just 2 seconds of testing and zero warmup it was still generating a lot of data. The problem with this is that since the data isn't (as far as I can see) being written to disk, this culminates in a huge use of RAM after the last input is tested. This makes the program hang for a long time without output, with the memory being hammered and making the OS itself a little jittery. Eventually, after waiting for about an hour, I do get the output and Benchee terminates peacefully. Here are some screenshots of what I saw on mac OS Catalina:
I understand I could have written the script differently, but I have a feeling this might not be expected behavior.