Performance degredation using RBAC in combination with belongsto filters in user groups

evertmulder commented 3 years ago

Hi,

we have a good sized ManageIQ environment and are using tenants, groups for our RBAC. We noticed significant performance differences in using ManageIQ as an Admin user or as a regular user. This leads us to take a closer look if the rbac implementation can be the cause of the slowdown.

We had a call with @agrare @Fryguy @kbrock @gubbe505 and myself to discus if we can find the root cause for this and to do some tests in our environment to get some numbers on the actual time spend in rbac and to deliver SQL logs.

To get some test data I created a little test tool to measure the impact of rbac on listing VMs. You can find the tool here: https://github.com/evertmulder/miq_rbac_log/blob/main/test_rbac.rb

The resulting log (including sql logs) can also be found in this repo: https://github.com/evertmulder/miq_rbac_log/blob/main/log_ivanchuk4_admin.txt --> Run tool as admin https://github.com/evertmulder/miq_rbac_log/blob/main/log_ivanchuk4_user1.txt --> Run tool as user

Note the test user has entitlements (belongsto_filters) as shown here: https://github.com/evertmulder/miq_rbac_log/blob/main/user_entitlement.txt

As can be seen in the logs the rbac does an additional 300+ sql calls.

For this test I am using ivanchuck-4.

We are more than happy to provide more logs or try out things to improve on this.

Fryguy commented 3 years ago

cool!

cc @NickLaMuro @kbrock cause of performance.

Fryguy commented 3 years ago

not based on anything but reading the code, I still think this is the ultimate problem because it walks the tree in a super-inefficient way (rewalks of the tree, in an inefficient way, multiple times in a loop).

https://github.com/ManageIQ/manageiq/blob/4abf8c6ddd9f1985c1264ca407837423695b1dfd/app/models/miq_filter.rb#L56-L67

Additionally, that's just for one tag. Typically there are a lot of tags, and so that method is called in another loop per tag. For example like here and here and here.

I'm not exactly sure how to correct this though, since the belongsto is based on name and not id, and we don't store the name in the relationships table. At a minimum we could probably find a way to do all of the tags at once so we don't walk the tree multiple times.

chessbyte commented 3 years ago

FYI - IIRC, the join table relating a Tag to a polymorphic object is Tagging.

chessbyte commented 3 years ago

also the hierarchical relationship is specified as a string that is parsed then searched for instead of actual SQL relationships. It has always been my intent to fix that - but not sure if that is causing the slow queries or the tag lookups or both

Fryguy commented 3 years ago

there aren't really any literal Tag instances here as far as I can tell...they are just named "tag" in the code (in previous iterations I think they actually were tags).

also the hierarchical relationship is specified as a string that is parsed then searched for instead of actual SQL relationships.

Correct. Some other example belongsto strings: https://github.com/evertmulder/miq_rbac_log/blob/29b9f4f89b18bf9856abfe06ef060fc959307754/user_entitlement.txt#L11-L35

The code I referenced (and the outer loop outside this code) roughly does the following psuedo-code:

belongsto.each do |str|
  parts = parse(str)

  # find the root obj from the first part
  obj = find_in_db(parts.shift)

  # walk the tree with the rest of the parts
  parts.each do |part|
    obj = obj.children.detect(part)
  end
end

So for a belongsto string, it parses it, finds the root object, then walks the tree step by step. Tree walking in this way is incredibly expensive because it has to go from the object, to the relationships table, down to the child relationships records, then load the objects from the actual table, then check the objects by name. That roughly ~6 queries per level in the string hierarchy. Then do it again for the next belongsto string.

So some things I think we can do are:

Preload the entire relationship tree in a single query, then walk that in-memory (might incur a lot of memory though with a huge hierarchy)
Load the children relationship records only (i.e. use .child_rels instead of .children), but don't resolve to the actual object, then filter that list before resolving the actual objects.
Only pull back the names of the records and not the entire actual object

Find a way to "mix" belongs to filters that are similar so they can be processed in one pass. For example:

"/belongsto/ExtManagementSystem|VMWare_VC7",
"/belongsto/ExtManagementSystem|VMWare_VC7/EmsFolder|Datacenters/EmsFolder|CLUSTER5/EmsFolder|vm/EmsFolder|Tenants/EmsFolder|VNX",
"/belongsto/ExtManagementSystem|VMWare_VC7/EmsFolder|Datacenters/EmsFolder|CLUSTER5/EmsFolder|vm/EmsFolder|Templates"]},

all references the same VC, and even a huge part of the same folder structure. If we could walk the tree and apply the filters simultaneously, you remove lot of redundant tree walking.

Medium-term, change how belongsto filters as persisted by not being by name but instead by id. Having it by id would mean we wouldn't have to resolve the relationship record to an actual object at all, and all filtering could be done in the relationship records themselves. I'm not sure this would have problems in the future though if record ids change
- Alternatively, we could store the object name in the relationships table, but I really don't like this idea because you'd pay a penalty on refresh of every record.
- Alternatively2, we could change belongsto to work directly on the relationships tree. That is instead of storing the part to the object, just store the object itself, and/or the relationships record id/ids directly.

chessbyte commented 3 years ago

I think that these belongsto strings should be converted to polymorphic relationships in a SQL table. I have never understood why we store them by name via a hacky string parsing mechanism to later parse them and look them up via SQL queries. Maybe @gtanzillo remembers why they were implemented that way in the first place, but my recollection is that tags were implemented first via the string mechanism and then the hierarchy stuff was added as a feature leveraging the same technique for expediency.

Fryguy commented 3 years ago

Yeah I don't understand why we need the full tree path is stored as opposed to just a reference to the leaf node. This is where my personal understanding of this feature starts to break down.

If we only have the leaf node, we can always reconstruct the path to the root in a much more efficient way (see object2belongsto method) if that full path is needed.

Fryguy commented 3 years ago

I think I figured out a quick PR that might just provide a lot of benefit...putting it together now.

gubbe505 commented 3 years ago

@Fryguy sorry, I tested your PR, but isn't working. We have EmsFolder with the same name on multiple VMware providers and by using the find_object_by_name will return a random EmsFolder with that name, but belonging to a different provider.

A few examples.

{"@timestamp":"2021-05-28T19:43:53.888081 ","hostname":"manageiq-1","pid":92993,"tid":"892600","level":"warning","message":"MIQ(MiqFilter.belongsto2object_list) lookup failed for tag=\"/belongsto/ExtManagementSystem|NSX-T VC Rotterdam/EmsFolder|Datacenters/EmsFolder|CPXVC-SDN101/EmsFolder|vm/EmsFolder|Tenants/EmsFolder|VNX\" parts=[\"ExtManagementSystem|NSX-T VC Rotterdam\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-SDN101\", \"EmsFolder|vm\", \"EmsFolder|Tenants\", \"EmsFolder|VNX\"] result_parts=[\"ExtManagementSystem|VMWare_Arnhem\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-VN602A\", \"EmsFolder|vm\", \"EmsFolder|Tenants\", \"EmsFolder|VNX\"]"}
{"@timestamp":"2021-05-28T19:43:53.912502 ","hostname":"manageiq-1","pid":92993,"tid":"892600","level":"warning","message":"MIQ(MiqFilter.belongsto2object_list) lookup failed for tag=\"/belongsto/ExtManagementSystem|NSX-T VC Rotterdam/EmsFolder|Datacenters/EmsFolder|CPXVC-SDN101/EmsFolder|vm/EmsFolder|Templates\" parts=[\"ExtManagementSystem|NSX-T VC Rotterdam\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-SDN101\", \"EmsFolder|vm\", \"EmsFolder|Templates\"] result_parts=[\"ExtManagementSystem|VMWare_Arnhem\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-VN602A\", \"EmsFolder|vm\", \"EmsFolder|Templates\"]"}
{"@timestamp":"2021-05-28T19:43:54.476128 ","hostname":"manageiq-1","pid":92993,"tid":"892600","level":"warning","message":"MIQ(MiqFilter.belongsto2object_list) lookup failed for tag=\"/belongsto/ExtManagementSystem|Oracle/EmsFolder|Datacenters/EmsFolder|CPXVC-VN105a/EmsFolder|vm/EmsFolder|Tenants/EmsFolder|VNX\" parts=[\"ExtManagementSystem|Oracle\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-VN105a\", \"EmsFolder|vm\", \"EmsFolder|Tenants\", \"EmsFolder|VNX\"] result_parts=[\"ExtManagementSystem|VMWare_Arnhem\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-VN602A\", \"EmsFolder|vm\", \"EmsFolder|Tenants\", \"EmsFolder|VNX\"]"}
{"@timestamp":"2021-05-28T19:43:54.507692 ","hostname":"manageiq-1","pid":92993,"tid":"892600","level":"warning","message":"MIQ(MiqFilter.belongsto2object_list) lookup failed for tag=\"/belongsto/ExtManagementSystem|Oracle/EmsFolder|Datacenters/EmsFolder|CPXVC-VN105a/EmsFolder|vm/EmsFolder|Templates\" parts=[\"ExtManagementSystem|Oracle\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-VN105a\", \"EmsFolder|vm\", \"EmsFolder|Templates\"] result_parts=[\"ExtManagementSystem|VMWare_Arnhem\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-VN602A\", \"EmsFolder|vm\", \"EmsFolder|Templates\"]"}

Next to that, it the difference isn't a great as expected: User apivnx. options done in 3.43 sec. rbac done in 7.09 sec. (vs 7.39 sec)

Fryguy commented 3 years ago

Thanks @gubbe505 . Yeah, I was concerned about duplicate names which is why it's still wip... I'm working on that next

Fryguy commented 3 years ago

@gubbe505 what does "options done in" and "rbac done in" refer to? What lines did you bookend to count that?

gubbe505 commented 3 years ago

@Fryguy See the first post, there has a script (rbac_test.rb) and 2 log files. The options done is the time of the query Vm.all, and the rbac done is the time of RBAC filtering of the result of the Vm.all.

Fryguy commented 3 years ago

oh i see now - it's in the tool itself (I was looking in the code): https://github.com/evertmulder/miq_rbac_log/blob/main/test_rbac.rb#L8-L16

The options done is the time of the query Vm.all, and the rbac done is the time of RBAC filtering of the result of the Vm.all.

That's not actually accurate. Vm.all there is a no-op because it just returns a relation (Rails lazy queries) and isn't actually resolved until later inside of the rbac timing section. I think the time penalty you are paying in options is actually loading the Vm constant itself (because it lazy loads the code), and I think you are paying a further penalty in the rbac section because it lazy loads all of the descendant classes.

Compare these two things:

> VmOrTemplate.descendants  # Preload the VmOrTemplate class and its descendant classes into memory
> Benchmark.realtime { Vm.all; nil }  # The ; nil prevents irb from executing the query itself
=> 0.06779100000858307
> Benchmark.realtime { Vm.all.to_a; nil } # The to_a is forcing the execution
=> 1.2748610000126064

Can we change your tool slightly? Try this instead:

+ # Warm up the class names
+ Relationship.pluck(:resource_type).uniq.each { |c| c.constantize.descendants }
+
  options = {}
  time_options = Benchmark.realtime {
  options = {
-     :targets => Vm.all,
+     :targets => Vm.all.to_a,
      :user => user,
      :skip_counts => true,
      :offset => 0,
      :limit => 1000
    }
  }

I have a feeling after that the numbers will shift around.

gubbe505 commented 3 years ago

New baseline tests (3x) options done in 3.31/3.30/3.25 sec (3.3 average) rbac done in 3.07/4.23/3.46 sec. (3.5 average)

New PR tests (3x) options done in 3.10/3.51/3.11 sec. (3.3 average) rbac done in 2.36/2.18/.254 sec. (2.4 average)

Still the same warnings because of the find_object_by_name, but an improvement of about 30%. Sound like a good start!

Fryguy commented 3 years ago

I updated the PR with I hope handling for duplicate names as well as being under the correct EMS. It will possibly introduce 2 more queries, but they are smaller ones.

gubbe505 commented 3 years ago

@Fryguy, @kbrock, @chessbyte, @evertmulder New tests after update of PR. This time around I included tests on different datacenters to see the impact of database latency.

Baseline/Pod and DB in same datacenter

Number of VMs 26 options done in 3.25/3.00/3.01 sec. (3.1 average) rbac done in 2.34/2.43/2.49 sec. (2.4 average) Lines in SQL logfile: 641

PR changes/Pod and DB in same datacenter

Number of VMs 26 options done in 3.33/3.14/3.13 sec. (3.2 average) rbac done in 2.16/2.20/2.30 sec. (2.2.average) Lines in SQL logfile: 329

Baseline/Pod and DB in different datacenter

Number of VMs 26 options done in 3.26/3.14/3.13 sec. (3.2 average) rbac done in 3.67/3.31/3.55 sec. (3.5 average)

PR changes/Pod and DB in different datacenter

Number of VMs 26 options done in 3.19/3.04/3.01 sec. (3.2 average) rbac done in 2.41/2.89/3.01 sec. (2.8 average)

Conclusion

The performance gain of the changes are less (10% same datacenter/20% different datacenter) that the first attempt, but now this working correctly. Also the impact of the database latency can be seen and is less a factor on a higher latency,

gubbe505 commented 3 years ago

Test is on a larger Tenant

Baseline/Pod and DB in different datacenter, large tenant

Number of VMs options done in 3.26/3.18/3.11 sec. (3.2 average) rbac done in 3.55/3.42/3.58 sec. (3.5 average)

PR changes/Pod and DB in different datacenter, large tenant

Number of VMs 475 options done in 3.06/3.29/3.26 sec. (3.2 average) rbac done in 2.88/2.78/2.88 sec. (2.8 average)

Conclusion

No bigger difference when filtering a larger number of VM's using the same number of belong to filters.

Fryguy commented 3 years ago

@gubbe505 When you have a chance can you give #21258 a try - this is a really simple refactoring that might have a huge impact, but maybe not on the pages you are looking at, so click all around.

miq-bot commented 1 year ago

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

miq-bot commented 1 year ago

This issue has been automatically closed because it has not been updated for at least 3 months.

Feel free to reopen this issue if this issue is still valid.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

ManageIQ / manageiq