MiniProfiler / rack-mini-profiler

Profiler for your development and production Ruby rack apps.
MIT License
3.71k stars 404 forks source link

PERF: Improve snapshots page performance #518

Closed OsamaSayegh closed 2 years ago

OsamaSayegh commented 2 years ago

Currently, the time that the snapshots page (/mini-profiler-resources/snapshots) takes to load grows linearly as the number of snapshots grows. By default we limit the number of snapshots to 1000, which is a reasonable limit, but when we reach 1000 snapshots the performance gets really bad that the page takes 4 seconds to load when using the Redis storage backend (everything said here applies to the Redis backend).

The reason for the bad performance is how snapshots are stored -- we keep all snapshots in a single redis hash where snapshot IDs map to marshalled snapshot objects, and when the snapshots page is requested we have to fetch everything from that redis hash and convert the data back to ruby objects to be able to group them and sort them etc. This means that we need to spend a lot time doing IO to transfer all the snapshots data from the redis server to the web server. A snapshot is on average ~280 KBs and since we allow 1000 snapshots by default, we have to fetch ~280 MBs worth of data everytime the snapshots page is requested.

To fix this performance problem, this PR changes the storage model so that every snapshot group is stored separately in its own redis hash, and we keep track of all group names and their worst score in an "overview" redis sorted set. This storage model allows us to respond to the snapshots page without having to fetch all the snapshots data; we can simply read the overview redis sorted set which is very small.

The downside of this new approach/storage model is that we have to do a little more work when saving a new snapshot. Specifically, we have to update the group's score in the overview set if necessary when a new snapshot is added to a group. We also have new limits on the number of groups and the number of snapshots per group that we need to respect when saving a new snapshot. That said, this extra overhead should be negligible and I think the tradeoff is completely worth it.

Note that this change doesn't migrate existing snapshots to the new storage scheme, so when this change is deployed the old snapshots won't show up in the snapshots page (but they'll still be retained in redis).

SamSaffron commented 2 years ago

I think avoiding a migration is fine, are you certain retention rules are good here and that data will expire automatically as time passes by?

OsamaSayegh commented 2 years ago

There are no retention rules, but we do limit the number of snapshots that are kept around. In the old storage model, the limit was 1000 snapshots by default. And now with this new storage model, we keep 50 groups by default (a group being all snapshots that are for the same route e.g. GET topics#create) and each group can keep 15 snapshots by default.

When the limit is reached and we get a new snapshot/group, we add it and evict the snapshot/group with the lowest score. The score of a snapshot is the duration of the request that generated the snapshot, so the slower the request is the higher the score becomes. Groups get their scores from the highest scoring snapshot in them.

Do you think we should add time retention rules on top of the limits we have? We can't expire individual snapshots because they're stored in a hash and redis doesn't support expiring individual hash keys, so we'd have to expire entire groups.

SamSaffron commented 2 years ago

Do you think we should add time retention rules on top of the limits we have?

I guess given this is 100% opt-in ... no retention rules besides limits are ok. As long as we have some way of properly limiting the bloat in Redis and clearing it up.

OsamaSayegh commented 2 years ago

Hmm this made me realize there is a small issue with the limits where if someone lowers the limits after they've been reached, then mini profiler wouldn't adapt to the new limits, i.e. it wouldn't evict the existing snapshots/groups that were beyond the new limits. I'll push a change to handle this case before merging this.

codecov-commenter commented 2 years ago

Codecov Report

Merging #518 (61bd01f) into master (2f79d56) will increase coverage by 0.11%. The diff coverage is 93.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #518      +/-   ##
==========================================
+ Coverage   87.36%   87.48%   +0.11%     
==========================================
  Files          18       18              
  Lines        1282     1286       +4     
==========================================
+ Hits         1120     1125       +5     
+ Misses        162      161       -1     
Impacted Files Coverage Δ
lib/mini_profiler/profiler.rb 83.71% <91.66%> (+0.31%) :arrow_up:
lib/mini_profiler/config.rb 88.46% <100.00%> (+0.14%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2f79d56...61bd01f. Read the comment docs.

OsamaSayegh commented 2 years ago

if someone lowers the limits after they've been reached, then mini profiler wouldn't adapt to the new limits, i.e. it wouldn't evict the existing snapshots/groups that were beyond the new limits. I'll push a change to handle this case before merging this.

This is done.

katelovescode commented 2 years ago

Just for clarity, this is the breaking change mentioned in the changelog, correct?

Note that this change doesn't migrate existing snapshots to the new storage scheme, so when this change is deployed the old snapshots won't show up in the snapshots page (but they'll still be retained in redis).

Basically we won't be able to see old data after updating our dependency, is that the only part that you'd consider a breaking change?

OsamaSayegh commented 2 years ago

Yes, this is the PR that introduces the breaking change and it only impacts people using the snapshots feature of Mini Profiler. The breaking changes are:

  1. The snapshots_limit config option has been replaced with max_snapshot_groups and max_snapshots_per_group

  2. Technically this won't "break" your application if you upgrade to 3.0.0, but snapshots that were collected before 3.0.0 won't show up in the snapshots page after upgrading to 3.0.0

  3. Snapshots-related methods in AbstractStore (and its subclasses RedisStore and MemoryStore) have been removed, replaced with new methods or had their signature changed:

    • push_snapshot: signature changed
    • load_snapshot signature changed
    • fetch_snapshots: removed
    • snapshot_groups_overview renamed to snapshots_overview
    • find_snapshots_group renamed to snapshots_group
    • fetch_snapshots_overview new method that subclasses of AbstractStore have to implement if they want to support snapshots
    • fetch_snapshots_group new method that subclasses of AbstractStore have to implement if they want to support snapshots