Assessing performance trends

sasa1977 commented 7 years ago

@sebastian @cristianberneanu @obrok

Inspired by #1488, I'm starting a discussion on how to assess performance trends in our system. Ideally, it should be simple to compare query times and memory usage for various types of queries and data sources between the current HEAD and some reference point (e.g. previous release).

The current make perftest is a good starting point, but I think we need to extend it to get more detailed and reliable numbers. Here are some ideas based on a cursory scan of the perf test code:

Test an OTP release, to be closer to the production system (currently, we're testing in dev environment).
Collect memory usage from the perf test (currently it's done by manually observing in some process viewer, such as htop).
Test different query features (subqueries, joins, row splitters, various filters).
Test against various data sources (currently, it's only PostgreSQL).
Test emulated queries.

Preferably, we'd should be able to run something like: ./compare_perf release_170200. The script would compare performance of the current local branch to the desired branch (or commit). The output would be performance numbers (times and memory usage, before/after and relative difference) for each data source (e.g. MongoDB, PostgreSQL, emulated queries) and each feature (simple aggregates, splitters, joins, subqueries). We might also output the numbers for each feature per data source (e.g. joins in a mongodb data source).

Obviously, it might happen that some queries are not testable with the previous version (for example, if the test query uses some newly supported feature), but this can be easily handled by the test script, which can output something like N/A or error for such cases.

Feel free to add your thoughts or other ideas.

obrok commented 7 years ago

Haha, I thought this was @sebastian posting, because you reproduced almost verbatim what I talked about with him some time ago on slack :D

One thing that I think would be very useful is graphing/tracking these numbers over time on master. That way you don't need to worry about performance with every change, but you can track a performance degradation back to a certain change when needed.

sasa1977 commented 7 years ago

Yeah, graphing would be very cool, and also helpful to eliminate false-negatives (an accidental perf degradation due to some non-related "stuff" happening on a test machine).

sebastian commented 7 years ago

Excellent suggestions.

Do we really need the ability to compare multiple versions with a single call? It sounds like that is going to add a boatload of complexity? How would you actually want to solve it – checkout and rebuild in the background (without affecting the test running machinery)? Although maybe it could be achieved through pointing the test at another folder on disk containing the version against which it should be compared?

All the same, it would be useful to have the ability to only run the test against the current version as well. If you are comparing against some baseline numbers, then these are likely to change while you make tweaks to some experimental code. Running repeated tests on the base line version would therefore be a waste?

Memory readings could be collected from inside BEAM? That would then discount the memory used by other system processes, which in fact seems desirable.

sebastian commented 7 years ago

Also I am not quite clear on where you would want to collect the memory stats for historical analysis? You mean we check in past runtimes to the repository so we have historical comparison data to check against? Or that we store times in some external system for fine grained historic graphs? (Both?)

sasa1977 commented 7 years ago

Do we really need the ability to compare multiple versions with a single call?

It's true that if we run tests once on some commit, we don't need to repeat that run anymore, as long as the results are saved. However, with repeated runs, we can expand our tests and get additional info for the past version(s). Meaning, I can add some extra queries in the future, and then get perf difference between the current and the past version. Perhaps, we could do that in the phase 2 though.

How would you actually want to solve it – checkout and rebuild in the background (without affecting the test running machinery)? Although maybe it could be achieved through pointing the test at another folder on disk containing the version against which it should be compared?

Yes I was thinking about the latter approach. We git clone to another folder (say /tmp/xyz), do checkout the commit there, build a release, start it, and run some queries.

All the same, it would be useful to have the ability to only run the test against the current version as well. If you are comparing against some baseline numbers, then these are likely to change while you make tweaks to some experimental code. Running repeated tests on the base line version would therefore be a waste?

When I talked about comparing, I mostly meant comparing the state of the HEAD to the current release. The goal is to have some more reliable idea about our performance trends, so that when we release the next version, we can report if there are radical improvements (or degradation).

Memory readings could be collected from inside BEAM? That would then discount the memory used by other system processes, which in fact seems desirable.

I concur :-)

Also I am not quite clear on where you would want to collect the memory stats for historical analysis? You mean we check in past runtimes to the repository so we have historical comparison data to check against? Or that we store times in some external system for fine grained historic graphs? (Both?)

I didn't think much about such operational details :-) I'd suggest having something very lightweight in the beginning, say a folder with files (one per each measurement). And maybe a script which produces a graph (or graphs) from these files, and maybe mails it to everyone, say once a week. Later on, if we're happy with this data, we can reach for something more mature to hold our time series.

sebastian commented 7 years ago

I didn't think much about such operational details :-) I'd suggest having something very lightweight in the beginning, say a folder with files (one per each measurement). And maybe a script which produces a graph (or graphs) from these files, and maybe mails it to everyone, say once a week. Later on, if we're happy with this data, we can reach for something more mature to hold our time series.

Ok, the crucial bit here being that you want this to be automatically run, say as part of the nightly build?

So there are two separate uses here then:

local usage to get difference between two versions
integration test server usage (nightly or whatever) recording the measurements in order to produce graphs

sebastian commented 7 years ago

acatlas1 has been set aside for these performance tests.

obrok commented 7 years ago

I think I'm not going to be doing this for the near future, as I'm going to be working on anonymization.

sebastian commented 7 years ago

I think I'm not going to be doing this for the near future, as I'm going to be working on anonymization.

What do you mean by that? We have a month now of tuning and testing and doing exactly that kind of work!?

obrok commented 7 years ago

Right! But anyway - I'm doing bugs now. I'll reassign to myself if and when I start working on this. Shouldn't this be in M4 if you think it's worth doing now?

sebastian commented 7 years ago

Shouldn't this be in M4 if you think it's worth doing now?

I guess it could be in the sense that it has priority. However it's not something I see as needing to be completed for M4 to be considered complete. That's the justification I am giving for not adding it.

Bugs definitively take priority.

sebastian commented 7 years ago

And I suppose this task can be done in a piecemeal fashion over time.

Aircloak / aircloak

Assessing performance trends #1500