Closed evertmulder closed 1 year ago
cool!
cc @NickLaMuro @kbrock cause of performance.
not based on anything but reading the code, I still think this is the ultimate problem because it walks the tree in a super-inefficient way (rewalks of the tree, in an inefficient way, multiple times in a loop).
Additionally, that's just for one tag. Typically there are a lot of tags, and so that method is called in another loop per tag. For example like here and here and here.
I'm not exactly sure how to correct this though, since the belongsto is based on name and not id, and we don't store the name in the relationships table. At a minimum we could probably find a way to do all of the tags at once so we don't walk the tree multiple times.
also the hierarchical relationship is specified as a string that is parsed then searched for instead of actual SQL relationships. It has always been my intent to fix that - but not sure if that is causing the slow queries or the tag lookups or both
there aren't really any literal Tag
instances here as far as I can tell...they are just named "tag" in the code (in previous iterations I think they actually were tags).
also the hierarchical relationship is specified as a string that is parsed then searched for instead of actual SQL relationships.
Correct. Some other example belongsto strings: https://github.com/evertmulder/miq_rbac_log/blob/29b9f4f89b18bf9856abfe06ef060fc959307754/user_entitlement.txt#L11-L35
The code I referenced (and the outer loop outside this code) roughly does the following psuedo-code:
belongsto.each do |str|
parts = parse(str)
# find the root obj from the first part
obj = find_in_db(parts.shift)
# walk the tree with the rest of the parts
parts.each do |part|
obj = obj.children.detect(part)
end
end
So for a belongsto string, it parses it, finds the root object, then walks the tree step by step. Tree walking in this way is incredibly expensive because it has to go from the object, to the relationships table, down to the child relationships records, then load the objects from the actual table, then check the objects by name. That roughly ~6 queries per level in the string hierarchy. Then do it again for the next belongsto string.
So some things I think we can do are:
Preload the entire relationship tree in a single query, then walk that in-memory (might incur a lot of memory though with a huge hierarchy)
Load the children relationship records only (i.e. use .child_rels
instead of .children
), but don't resolve to the actual object, then filter that list before resolving the actual objects.
Only pull back the names of the records and not the entire actual object
Find a way to "mix" belongs to filters that are similar so they can be processed in one pass. For example:
"/belongsto/ExtManagementSystem|VMWare_VC7",
"/belongsto/ExtManagementSystem|VMWare_VC7/EmsFolder|Datacenters/EmsFolder|CLUSTER5/EmsFolder|vm/EmsFolder|Tenants/EmsFolder|VNX",
"/belongsto/ExtManagementSystem|VMWare_VC7/EmsFolder|Datacenters/EmsFolder|CLUSTER5/EmsFolder|vm/EmsFolder|Templates"]},
all references the same VC, and even a huge part of the same folder structure. If we could walk the tree and apply the filters simultaneously, you remove lot of redundant tree walking.
Medium-term, change how belongsto filters as persisted by not being by name but instead by id. Having it by id would mean we wouldn't have to resolve the relationship record to an actual object at all, and all filtering could be done in the relationship records themselves. I'm not sure this would have problems in the future though if record ids change
I think that these belongsto
strings should be converted to polymorphic relationships in a SQL table. I have never understood why we store them by name via a hacky string parsing mechanism to later parse them and look them up via SQL queries. Maybe @gtanzillo remembers why they were implemented that way in the first place, but my recollection is that tags were implemented first via the string mechanism and then the hierarchy stuff was added as a feature leveraging the same technique for expediency.
Yeah I don't understand why we need the full tree path is stored as opposed to just a reference to the leaf node. This is where my personal understanding of this feature starts to break down.
If we only have the leaf node, we can always reconstruct the path to the root in a much more efficient way (see object2belongsto method) if that full path is needed.
I think I figured out a quick PR that might just provide a lot of benefit...putting it together now.
@Fryguy sorry, I tested your PR, but isn't working. We have EmsFolder with the same name on multiple VMware providers and by using the find_object_by_name will return a random EmsFolder with that name, but belonging to a different provider.
A few examples.
{"@timestamp":"2021-05-28T19:43:53.888081 ","hostname":"manageiq-1","pid":92993,"tid":"892600","level":"warning","message":"MIQ(MiqFilter.belongsto2object_list) lookup failed for tag=\"/belongsto/ExtManagementSystem|NSX-T VC Rotterdam/EmsFolder|Datacenters/EmsFolder|CPXVC-SDN101/EmsFolder|vm/EmsFolder|Tenants/EmsFolder|VNX\" parts=[\"ExtManagementSystem|NSX-T VC Rotterdam\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-SDN101\", \"EmsFolder|vm\", \"EmsFolder|Tenants\", \"EmsFolder|VNX\"] result_parts=[\"ExtManagementSystem|VMWare_Arnhem\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-VN602A\", \"EmsFolder|vm\", \"EmsFolder|Tenants\", \"EmsFolder|VNX\"]"}
{"@timestamp":"2021-05-28T19:43:53.912502 ","hostname":"manageiq-1","pid":92993,"tid":"892600","level":"warning","message":"MIQ(MiqFilter.belongsto2object_list) lookup failed for tag=\"/belongsto/ExtManagementSystem|NSX-T VC Rotterdam/EmsFolder|Datacenters/EmsFolder|CPXVC-SDN101/EmsFolder|vm/EmsFolder|Templates\" parts=[\"ExtManagementSystem|NSX-T VC Rotterdam\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-SDN101\", \"EmsFolder|vm\", \"EmsFolder|Templates\"] result_parts=[\"ExtManagementSystem|VMWare_Arnhem\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-VN602A\", \"EmsFolder|vm\", \"EmsFolder|Templates\"]"}
{"@timestamp":"2021-05-28T19:43:54.476128 ","hostname":"manageiq-1","pid":92993,"tid":"892600","level":"warning","message":"MIQ(MiqFilter.belongsto2object_list) lookup failed for tag=\"/belongsto/ExtManagementSystem|Oracle/EmsFolder|Datacenters/EmsFolder|CPXVC-VN105a/EmsFolder|vm/EmsFolder|Tenants/EmsFolder|VNX\" parts=[\"ExtManagementSystem|Oracle\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-VN105a\", \"EmsFolder|vm\", \"EmsFolder|Tenants\", \"EmsFolder|VNX\"] result_parts=[\"ExtManagementSystem|VMWare_Arnhem\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-VN602A\", \"EmsFolder|vm\", \"EmsFolder|Tenants\", \"EmsFolder|VNX\"]"}
{"@timestamp":"2021-05-28T19:43:54.507692 ","hostname":"manageiq-1","pid":92993,"tid":"892600","level":"warning","message":"MIQ(MiqFilter.belongsto2object_list) lookup failed for tag=\"/belongsto/ExtManagementSystem|Oracle/EmsFolder|Datacenters/EmsFolder|CPXVC-VN105a/EmsFolder|vm/EmsFolder|Templates\" parts=[\"ExtManagementSystem|Oracle\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-VN105a\", \"EmsFolder|vm\", \"EmsFolder|Templates\"] result_parts=[\"ExtManagementSystem|VMWare_Arnhem\", \"EmsFolder|Datacenters\", \"EmsFolder|CPXVC-VN602A\", \"EmsFolder|vm\", \"EmsFolder|Templates\"]"}
Next to that, it the difference isn't a great as expected: User apivnx. options done in 3.43 sec. rbac done in 7.09 sec. (vs 7.39 sec)
Thanks @gubbe505 . Yeah, I was concerned about duplicate names which is why it's still wip... I'm working on that next
@gubbe505 what does "options done in" and "rbac done in" refer to? What lines did you bookend to count that?
@Fryguy See the first post, there has a script (rbac_test.rb) and 2 log files. The options done is the time of the query Vm.all, and the rbac done is the time of RBAC filtering of the result of the Vm.all.
oh i see now - it's in the tool itself (I was looking in the code): https://github.com/evertmulder/miq_rbac_log/blob/main/test_rbac.rb#L8-L16
The options done is the time of the query Vm.all, and the rbac done is the time of RBAC filtering of the result of the Vm.all.
That's not actually accurate. Vm.all
there is a no-op because it just returns a relation (Rails lazy queries) and isn't actually resolved until later inside of the rbac timing section. I think the time penalty you are paying in options is actually loading the Vm constant itself (because it lazy loads the code), and I think you are paying a further penalty in the rbac section because it lazy loads all of the descendant classes.
Compare these two things:
> VmOrTemplate.descendants # Preload the VmOrTemplate class and its descendant classes into memory
> Benchmark.realtime { Vm.all; nil } # The ; nil prevents irb from executing the query itself
=> 0.06779100000858307
> Benchmark.realtime { Vm.all.to_a; nil } # The to_a is forcing the execution
=> 1.2748610000126064
Can we change your tool slightly? Try this instead:
+ # Warm up the class names
+ Relationship.pluck(:resource_type).uniq.each { |c| c.constantize.descendants }
+
options = {}
time_options = Benchmark.realtime {
options = {
- :targets => Vm.all,
+ :targets => Vm.all.to_a,
:user => user,
:skip_counts => true,
:offset => 0,
:limit => 1000
}
}
I have a feeling after that the numbers will shift around.
New baseline tests (3x) options done in 3.31/3.30/3.25 sec (3.3 average) rbac done in 3.07/4.23/3.46 sec. (3.5 average)
New PR tests (3x) options done in 3.10/3.51/3.11 sec. (3.3 average) rbac done in 2.36/2.18/.254 sec. (2.4 average)
Still the same warnings because of the find_object_by_name, but an improvement of about 30%. Sound like a good start!
I updated the PR with I hope handling for duplicate names as well as being under the correct EMS. It will possibly introduce 2 more queries, but they are smaller ones.
@Fryguy, @kbrock, @chessbyte, @evertmulder New tests after update of PR. This time around I included tests on different datacenters to see the impact of database latency.
Number of VMs 26 options done in 3.25/3.00/3.01 sec. (3.1 average) rbac done in 2.34/2.43/2.49 sec. (2.4 average) Lines in SQL logfile: 641
Number of VMs 26 options done in 3.33/3.14/3.13 sec. (3.2 average) rbac done in 2.16/2.20/2.30 sec. (2.2.average) Lines in SQL logfile: 329
Number of VMs 26 options done in 3.26/3.14/3.13 sec. (3.2 average) rbac done in 3.67/3.31/3.55 sec. (3.5 average)
Number of VMs 26 options done in 3.19/3.04/3.01 sec. (3.2 average) rbac done in 2.41/2.89/3.01 sec. (2.8 average)
The performance gain of the changes are less (10% same datacenter/20% different datacenter) that the first attempt, but now this working correctly. Also the impact of the database latency can be seen and is less a factor on a higher latency,
Test is on a larger Tenant
Number of VMs options done in 3.26/3.18/3.11 sec. (3.2 average) rbac done in 3.55/3.42/3.58 sec. (3.5 average)
Number of VMs 475 options done in 3.06/3.29/3.26 sec. (3.2 average) rbac done in 2.88/2.78/2.88 sec. (2.8 average)
No bigger difference when filtering a larger number of VM's using the same number of belong to filters.
@gubbe505 When you have a chance can you give #21258 a try - this is a really simple refactoring that might have a huge impact, but maybe not on the pages you are looking at, so click all around.
This issue has been automatically marked as stale because it has not been updated for at least 3 months.
If you can still reproduce this issue on the current release or on master
, please reply with all of the information you have about it in order to keep the issue open.
Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.
This issue has been automatically closed because it has not been updated for at least 3 months.
Feel free to reopen this issue if this issue is still valid.
Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.
Hi,
we have a good sized ManageIQ environment and are using tenants, groups for our RBAC. We noticed significant performance differences in using ManageIQ as an Admin user or as a regular user. This leads us to take a closer look if the rbac implementation can be the cause of the slowdown.
We had a call with @agrare @Fryguy @kbrock @gubbe505 and myself to discus if we can find the root cause for this and to do some tests in our environment to get some numbers on the actual time spend in rbac and to deliver SQL logs.
To get some test data I created a little test tool to measure the impact of rbac on listing VMs. You can find the tool here: https://github.com/evertmulder/miq_rbac_log/blob/main/test_rbac.rb
The resulting log (including sql logs) can also be found in this repo: https://github.com/evertmulder/miq_rbac_log/blob/main/log_ivanchuk4_admin.txt --> Run tool as admin https://github.com/evertmulder/miq_rbac_log/blob/main/log_ivanchuk4_user1.txt --> Run tool as user
Note the test user has entitlements (belongsto_filters) as shown here: https://github.com/evertmulder/miq_rbac_log/blob/main/user_entitlement.txt
As can be seen in the logs the rbac does an additional 300+ sql calls.
For this test I am using ivanchuck-4.
We are more than happy to provide more logs or try out things to improve on this.