TireSwingSoftware / openassign-server

OpenAssign server intended for use by a separate client via RPC
BSD 3-Clause "New" or "Revised" License
6 stars 2 forks source link

Implement caching for OrgRole checks #135

Closed jc0n closed 12 years ago

jc0n commented 12 years ago

I understand that performance is not a top priority, however I believe there is a significant advantage in doing this before the remaining authorizer roles (which depend on these checks) are implemented. I also don't consider this premature optimization since the implications are well known (and described below).

There are multiple roles which all utilize the same authorizer orgrole checks. It would be wise to implement a specific caching mechanism for orgrole check methods for several reasons.

I did some profiling when I wrote the checks initially, and wasn't pleased with the performance despite each check being very light weight. The checks were sufficient for one or two roles but at this point with the addition of more roles, I think the checks are going to be cumbersome as nearly every check requires at least one database query. With multiple roles, there will be multiple checks per object per operation for every user (including those without the roles).

Another problem that I did not anticipate (with the Organization Admin and Owner Manager roles) was the requirement for recognizing the ancestry relationship for an organization. In the case of #132 and #133

"the scope is limited to an organization and all child orgs".

If the orgrole checks are changed to check the children as well this will introduce at least 1 additional query per check and potentially N-1 additional queries where N is the depth of the Organization hierarchy. I believe this use case further promotes the need for caching.

This would be relatively easy to implement without introducing any significant overhead in terms of memory. The caching would be able to achieve a significant reduction in database activity from these methods. Nearly all of the methods will be able to work solely based on the cache[1]. The extra complexity will be encapsulated in the orgrole checks suite (not the authorizer itself). Django signals can be used to keep the cache coherent when UserOrgRoles are changed.

My proposal for the cache data structure is a 2-dimensional dict which stores a set of organization ids

ORGROLE_CACHE[role_name][user_id] = set(organization_ids) # includes children organizations

_[1] Assuming checked attributes of the actee (such as the organizationid) are already present in the object.

mhrivnak commented 12 years ago

It's important to distinguish that we are potentially caching two different things here: a user's actual org memberships, and a user's set of child orgs for any given role.

It may be advantageous to cache these on the cached auth token object. This would allow us to keep the cache current when users are added and removed from organizations. This would require:

It then would be easy to extend this to other authorizer checks, such as actor_member_of_group

As for the cache structure itself, perhaps something like this:

mhrivnak commented 12 years ago

As for actually determining if org A is a parent of org B, there is at least one elegant but heavy-handed way of doing so: http://django-mptt.github.com/django-mptt/

However, the downfall of that approach is update performance. Particularly when importing lots of orgs from CSV, django-mptt could thrash the database while rebuilding the tree data on every insert.

My understanding is that this could be a drop-in replacement that would not change the model's current API, but would add some nice features. It could be worth a touch of testing to see what the performance impact is of creating a few hundred records rapid-fire. If it's not bad, we should probably use it.

mhrivnak commented 12 years ago

Discussion of django-mptt's bulk update performance issue: https://github.com/django-mptt/django-mptt/issues/69

jc0n commented 12 years ago

It's important to distinguish that we are potentially caching two different things here: a user's actual org memberships, and a user's set of child orgs for any given role.

If these do need to be separate (maybe because not all callers (roles) of the orgrole checks want this functionality). I would propose the same data structure as above, with two sets instead of one. The first would be a set of direct organization_ids and the second would be a set of all descendant organizations. Then the checks can be parameterized to determine which organization_ids are checked.

It may be advantageous to cache these on the cached auth token object. This would allow us to keep the cache current when users are added and removed from organizations.

The cache coherency does not depend on the auth token. We can achieve the same by watching signals on UserOrgRole objects and adding/removing the organization from the ORGROLE_CACHE[org_role_name][user_id] set. The performance gain is achieved because this is also a highly specific cache for the UserOrgRole checks. The point was to streamline the access pattern for those checks.

It then would be easy to extend this to other authorizer checks, such as actor_member_of_group

You may be confusing membership with Groups vs OrgRole/UserOrgRole. However, caching groups is of similar benefit. actor_member_of_group is one of the more common checks.

As for the cache structure itself, perhaps something like this: Cache the actual UserOrgRole instances for the user separately cache a map of hierarchical membership decisions per-role. For example, >auth_token._cache.HAS_ROLE_IN_PARENT_ORG[role_id][org_id] = True

The problem here is that it does not optimize the most common access pattern of the checks. This would require a lookup for the role id given the name (which happends in every check) and it doesn't account for the children orgs (which may be the larger problem).

Each OrgRole check function starts with a role_name, an actor, and one or more organizations from the actee that are being verified. The reason why I think my proposed datastructure is optimal is because with the role_name and (actor) auth_token.user_id you can immidiately get a set of candidate organization ids, all of which the user_id belongs (or belongs to an ancestor) with role role_name. Then you can intersect the two sets of organization ids and if you get a non-empty subset the check passes. In the case of checking just one organization id it is just a set membership test.

The data structure itself is also very lightweight since it is only storing a set of integers and not the full blown OrgRole/UserOrgRole objects which carry many other fields which are of no interest here.

As for actually determining if org A is a parent of org B, there is at least one elegant but heavy-handed way of doing so: >http://django-mptt.github.com/django-mptt/

I have worked with the MPTT algorithm (not the django-mptt) several times before and I think it is overkill for this particular problem. It is well suited for a hierarchical structure such as a forum with sub-forums and topics where special ordering is required. Unless there are other significant use cases which justify the extra overhead, I don't think it brings much here.

mhrivnak commented 12 years ago

On Sun, Mar 18, 2012 at 2:16 PM, John O'Connor reply@reply.github.com wrote:

It's important to distinguish that we are potentially caching two different things here: a user's actual org memberships, and a user's set of child orgs for any given role.

If these do need to be separate (maybe because not all callers (roles) of the orgrole checks want this functionality). I would propose the same data structure as above, with two sets instead of one. The first would be a set of direct organization_ids and the second would be a set of all descendant organizations. Then the checks can be parameterized to determine which organization_ids are checked.

We might not need both. I was just trying to say that it's very important to be clear about which we are storing.

It may be advantageous to cache these on the cached auth token object. This would allow us to keep the cache current when users are added and removed from organizations.

The cache coherency does not depend on the auth token. We can achieve the same by watching signals on UserOrgRole objects and adding/removing the organization from the ORGROLE_CACHE[org_role_name][user_id] set. The performance gain is achieved because this is also a highly specific cache for the UserOrgRole checks. The point was to streamline the access pattern for those checks.

I suppose I like the implicit expiration when caching a user's authorization-related data on their auth token. Have you thought about expiration from your cache model?

I am assuming (perhaps incorrectly) that it would take advantage of django's caching framework and thus be in a global cache independent of any particular process. With your data structure, I guess that would involve unpickling the whole thing for each use, which may not be desirable. Or perhaps you have some other way in mind of retrieving it once per request? The auth token, however, is already being retrieved from the cache once for each request, so it would be fairly efficient to cache authorization-decision-related data on it. There is also a conceptual harmony to caching authorization decisions on an auth token.

The alternative would be to have independent per-process caches, which would cause concurrency problems.

It then would be easy to extend this to other authorizer checks, such as actor_member_of_group

You may be confusing membership with Groups vs OrgRole/UserOrgRole. However, caching groups is of similar benefit. actor_member_of_group is one of the more common checks.

Nope, I did indeed mean that this approach would be useful for different types of checks, such as group membership.

As for the cache structure itself, perhaps something like this: Cache the actual UserOrgRole instances for the user separately cache a map of hierarchical membership decisions per-role. For example, >auth_token._cache.HAS_ROLE_IN_PARENT_ORG[role_id][org_id] = True

The problem here is that it does not optimize the most common access pattern of the checks. This would require a lookup for the role id given the name (which happends in every check) and it doesn't account for the children orgs (which may be the larger problem).

Suffice it to say, you can use any unique key that is appropriate. The point of this structure is this:

  1. You have an actor, an org, and a role.
  2. You answer the question "Does the actor have the role in the org or any parent org?", hopefully using the cached data structure described above.
  3. If there was a cache miss, you cache the answer in the data structure described above.

So I think this does optimize the most common access pattern.

mhrivnak commented 12 years ago

Ugh, sorry the formatting looks horrible on that last message. I'm not sure why.

jc0n commented 12 years ago

the formatting looks horrible on that last message.

I think github quoting is broken in general (as you'll see w/the numbering from your previous post below) and inconsistent with some mail clients ie. gmail. However, I usually like to write my comments directly on the tracker because I end up reading the issues here.

I suppose I like the implicit expiration when caching a user's authorization-related data on their auth token. Have you thought about expiration from your cache model?

I hadn't considered expiring the entries based on time because the data being cached is not time-sensitive. The only actions within the system that invalidate the cached data are the modification or deletion of a UserOrgRole. However, this could solve another problem that I was thinking about which was how to bound the cache size. Though, in practice, this type of cache has a very small memory requirement so it may not be worth bounding at all.

I am assuming (perhaps incorrectly) that it would take advantage of django's caching framework and thus be in a global cache independent of any particular process.

With regards to Django cache, whether or not the cache is per-process depends on the cache backend. For an in-memory cache which is what I think best serves this purpose, it is not global and is in fact per-process. I was also under the impression that our request model was one process with multiple threads but correct me if I'm wrong.

The auth token, however, is already being retrieved from the cache once for each request, so it would be fairly efficient to cache authorization-decision-related data on it. There is also a conceptual harmony to caching authorization decisions on an auth token.

I agree about the conceptual harmony. The one caveat is that the cache (or a subset thereof) will have to be rebuilt much more frequently (every time an auth_token is created) and there is a non-trivial cost associated with that. I was hoping to build it once at start-up and make the minor mutations/additions when entries are invalidated which is trivial.

Building the cache once was also one of the ways of avoiding the need for adding additional complexity to the Organization model. There is non-trivial cost in building the cache (even just for one user) because you have to construct the organization tree. I was hoping to do one query for UserOrgRole (inner join OrgRole) and one query for Organizations. Then with the results, construct the org tree and from that the cache.

Nope, I did indeed mean that this approach would be useful for different types of checks, such as group membership.

I agree but I was unclear about how this implementation would serve both purposes. Caching groups is a much different/easier problem. I think caching groups within the auth_token makes perfect sense. I'm not convinced it is optimal here.

Suffice it to say, you can use any unique key that is appropriate. The point of this structure is this:

Right.

  1. You have an actor, an org, and a role.

You have one or more actee orgs.

  1. You answer the question "Does the actor have the role in the org or any parent org?", hopefully using the cached data structure described above.

Its a bit more complicated. Its whether or not the actor has the role in an org or parent org of which the actee belongs. The actee also potentially having one or many orgs depending on the type.

As a side note, your structure would be better represented using a set which is roughly what the dict -> True pattern is, its just less efficient.

  1. If there was a cache miss, you cache the answer in the data structure described above.

That is one way of doing it. I had planned on building the cache once at the beginning and invalidating entries using the signals (on the infrequent occasion that UserOrgRoles change relative to other objects), otherwise its completely static.