Open gerbyzation opened 4 years ago
Hi @gerbyzation I want to know what kinds of rules count for most of the 2.5k rules? How many p rules and how many g rules? I saw a lot of:
g,act.io/user:cde7860e-5459-413c-969a-9536edbb1980,act.io/org:acmecorp/campaign:1/group:admins
g,act.io/user:3ae34b92-a0f8-4b3f-a7bd-877c9ba303a4,act.io/org:acmecorp/campaign:1/group:members
So each user ID will appear as one g
rule, right? But g
rules will be loaded into memory as a RBAC inheritance tree, so it won't be slow.
I also see bunches of rules like this:
g,act.io/org:acmecorp/campaign:1/group:admins,act.io/org:acmecorp/campaign:1/group:members
p,act.io/org:acmecorp/campaign:1/group:admins,act.io/org:acmecorp/campaign:1/info,edit
p,act.io/org:acmecorp/campaign:1/group:admins,act.io/org:acmecorp/campaign:1/group:members,manage
p,act.io/org:acmecorp/campaign:1/group:members,act.io/org:acmecorp/campaign:1,view
p,act.io/org:acmecorp/campaign:1/group:members,act.io/org:acmecorp/campaign:1/submission:*,approve
p,act.io/org:acmecorp/campaign:1/group:members,act.io/org:acmecorp/campaign:1/version:*,create
p,act.io/org:acmecorp/campaign:1/group:members,act.io/org:acmecorp/campaign:1/version:*,read
p,act.io/org:acmecorp/campaign:1/group:members,act.io/org:acmecorp/campaign:1/version:*,update
Will this same rule set duplicate from 1, 2, 3, .. to 100000? If yes, why not use one single "pattern" rule set to replace them all? So you will end up with a very small set of rules.
@hsluoyz at the moment we have ~2.5k p rules and ~1k g rules.
Yes, we'd create these kind of rule set for each org and project/campaign that is created. Our plan was to have granular rules to allow permissions to be configurable for a specific org, project or team.
Thanks for taking a look!
Will campaign:1
people be able to access campaign:2
or even another org's resources (aka cross-domain access)?
@hsluoyz yes, when given permission ofcourse. We want people to have a single account that can have access to various organizations and/or campaigns.
First, you can merge these into one:
p,act.io/org:acmecorp/campaign:1/group:members,act.io/org:acmecorp/campaign:1/version:*,create
p,act.io/org:acmecorp/campaign:1/group:members,act.io/org:acmecorp/campaign:1/version:*,read
p,act.io/org:acmecorp/campaign:1/group:members,act.io/org:acmecorp/campaign:1/version:*,update
into:
p,act.io/org:acmecorp/campaign:1/group:members,act.io/org:acmecorp/campaign:1/version:*,(create)|(read)|(update)
Of course, model needs to be updated.
@hsluoyz ok so that would change the matcher to m = g(r.sub, p.sub) && keyMatch(r.obj, p.obj) && regexMatch(r.act, p.act)
How does the regexMatch
relate to keyMatch
in terms of efficiency?
So i'll toss my findings in here rather than creating a new ticket, Python is not Go and performance is approaching 50x time worse than the benchmarks. Just a simple sample implementation:
import time
from pathlib import Path
from casbin import Enforcer
root_dir = Path(__file__).parent
enforcer: Enforcer = Enforcer(f"{str(root_dir)}/rbac_model.conf")
for x in range(11000):
enforcer.add_policy(f"/user{x}", "/obj", "read")
s = time.time()
a = enforcer.enforce("/user", "obj", "read")
print(a, (time.time() - s) * 1000)
Output is False 62.0729923248291
, according to benchmarks it should be ~2. Now python is not go, so it is not that I am complaining code is bad here or anything this is just a fact of our chosen language. I have PRs open to implemented filter policy support which is a massively big win here as my ruleset is 50k so being able to get it down to a few hundred or a few thousands for certain long large actions a major plus. See #65 and will add the sqlalchemy adapter support once approved.
You should use filtered policies independent of what I put below as loading up a millions of entries is a massive long time and there isn't much you can do about that impact.
The root cause of this performance issue is : https://github.com/casbin/pycasbin/blob/88bcf96eb0586acd5a2cf3d3bd22a7802a0bfb27/casbin/core_enforcer.py#L238
It is a O(n) operation and as the number of rules scale the performance will degrade, not much way around this. This really comes from the fact of how generalized Casbin is, it starts as a standard object/subject/action rules system but it doesn't use any of that knowledge in its engine, you could completely disregard that strategy and just have a single string comparison if you wanted.
For me, for performance you need to remove generalization. If you can make assumptions, you can find way to work some magic. And for me, that magic was assuming that we have a standard object/subject/action
base rule system and you can add whatever additional fields on the end, just having these three are important.
So this is my request definition I am working with:
[request_definition]
r = sub, obj, act
The important bit now is, how can I shrink the size of my model so that the n
is small and fast. The way I did this was tackling the model storage. By default the model
object stores the rules in a list and the rule is also a list. But there is nothing stopping us from injecting our own storage engine, so I have made my own model class and my own storage engine for the rules:
class FilteredPolicy:
_cache: Dict[str, Dict[str, Set[Sequence[str]]]]
_current_filter: Optional[Set[Sequence[str]]]
def __init__(self):
self._cache = {}
self._current_filter = None
def __iter__(self):
yield from self.__get_policy()
def __len__(self):
return len(list(self.__get_policy()))
def __contains__(self, item):
k1 = item[1]
k2 = item[2]
if k1 not in self._cache:
return False
if k2 not in self._cache[k1]:
return False
return tuple(item) in self._cache[k1][k2]
def append(self, item: Sequence[str]) -> None:
k1 = item[1]
k2 = item[2]
if k1 not in self._cache:
self._cache[k1] = {}
if k2 not in self._cache[k1]:
self._cache[k1][k2] = set()
self._cache[k1][k2].add(tuple(item))
def remove(self, policy: Sequence[str]) -> bool:
k1 = policy[1]
k2 = policy[2]
if k1 not in self._cache:
return True
if k2 not in self._cache[k1]:
return True
self._cache[k1][k2].remove(tuple(policy))
return True
def __get_policy(self):
if self._current_filter is not None:
return (list(x) for x in self._current_filter)
else:
return (list(v2) for v in self._cache.values() for v1 in v.values() for v2 in v1)
def apply_filter(self, k1, k2):
if k1 not in self._cache:
self._current_filter = set()
elif k2 not in self._cache[k1]:
self._current_filter = set()
else:
self._current_filter = self._cache[k1][k2]
def clear_filter(self) -> None:
self._current_filter = None
class FastModel(Model):
def add_def(self, sec, key, value):
super().add_def(sec, key, value)
if sec == "p" and key == "p":
self.model[sec][key].policy = FilteredPolicy()
This is where the assumption of the request format matters, I purposely abuse the predefined order of operations to store my model in layered dictionaries. I also mimic the List
protocol so this acts as a drop in replacement without changing any other code in the code base. On the surface this doesn't do anything until it is combined with one last modification to the enforcer:
class FastEnforcer(Enforcer):
@staticmethod
def new_model(path: str = "", text: str = "") -> FastModel:
"""creates a model."""
m = FastModel()
if len(path) > 0:
m.load_model(path)
else:
m.load_model_from_text(text)
return m
def enforce(self, *rvals: str) -> bool:
self.model.model["p"]["p"].policy.apply_filter(rvals[1], rvals[2])
result = super().enforce(*rvals)
self.model.model["p"]["p"].policy.clear_filter()
return result
Putting this all together, I now have shrunk down the size of my n
to only be the number of rules that match both the action
and the object
, so the number of subjects that have individual rules that match here is the size of n
. Now it doesn't matter how big my global rule set is, it only matters the number of rules matching those two unique identifiers. This allows me to remove all need for trying to combing down rule sets and use regular expressions and keep my ruleset condensed. If fact condensing rulesets is now bad and not performant because regular expressions are slow then compared to string comparisons.
If i modify my initial experiment above to have 10 million rows, you can see the massive performance gains:
for x in range(100):
for y in range(100000):
enforcer.add_policy(f"/user{x}", f"/obj{y}", "read")
s = time.time()
# this is the absolute worst case last entry and should require iterating 10M rows and be very slow
a = enforcer.enforce("/user99", "/obj99999", "read")
print(a, (time.time() - s) * 1000)
Output: True 0.8349418640136719
So what are the side effects? Speed, speed everywhere. Adds are significantly sped up on large rule sets, deletes get the same treatment and so enforcements. You are tied to your models request format, but honestly you are also tied to this format based on the need to persist the model anyways. If your model requires additional layers of caching, or your model request order does not match the order given here the cache mechanism will explode.
This is a rather simple fix, if you adjust the FilteredPolicy
to meet your needs then you can drop in replace it for your current model format. We have not touched any of the internal about how Casbin works so any other extensions or plugins someone one day generates should also continue to function as well.
Now I put all of this here hoping it helps someone with improving their performance, but also as a question to the devs. Is something like this possible to merge into the main package, or if does needs to be an external package. I am happy to create a new package that extends the current casbin implementation like this but I figured it would be best to see the feedback here first. I have a bunch of other management improvements that I would likely ship in an external package but would rather have people be able to get as much benefit as possible without them hopefully find my random package.
I have taken what I created here, generalized it a whole lot and did the needful and wrapped it up into a package.
https://pypi.org/project/fastbin/
I will be porting my current production usage away from the code above to this package and roll it to v1 stable release next week hopefully. Been using the sample code for a few months running stable so there shouldn't be too many issues to sort out with the package.
We've been growing our test data set to about 2.5k rules in the policy set, which is making calls to the enforcer quite slow. Is there something that can be done to speed things up, or did I just pick a terrible model for performance?
model:
example policy: