Question: Filtering collections based on permission AND also on external search params

henkosch commented 2 years ago

We have started implementing a permission system using spicedb, but we are tackling with the following theoretical question. Suppose we have documents and users. The documents are organized in folders and some users have permission to view documents in some folders. The use case in question is to search for documents using a full text search or some other criteria, but taking into account that you would only need to search the ones that the user has view permissions on.

Something like:

SELECT id FROM document
WHERE document.title LIKE search
AND check(document:id#can_view@user:current)

I can imagine 2 strategies:

First perform a spicedb lookup query to find all the documents that the current user has view permission on, stream all the document ids from the permission system and then filter that down more based on the other search criteria.
First perform the full text search locally, then check each item one-by-one in spicedb if the user has the view permission on it or not.

Both of these seem suboptimal and could result in a very expensive operation of potentially moving the whole database between systems. Are there any guidelines, docs or cases studies that we could read? Thank you!

josephschorr commented 2 years ago

Hi @henkosch, your two listed options are the recommendations today: pre and post filtering.

As you mention, however, they can become quite burdensome at much larger scales or where filtering is tied to external metadata. To solve that problem, we're currently working on this proposal: https://github.com/authzed/spicedb/issues/207.

In summary, it will provide an API called LookupWatch which can be used to actively monitor the changes in resources to which a subject has access, and an API called RoaringLookupResources, which will return a cached roaring bitmap representing that access. Once available, the expectation is that a caller can then retrieve the roaring bitmap from the cache, and then pass it as an additional filter to something like ElasticSearch, thereby providing the ability to do an ACL-aware filtered search.

daniel-munch-cko commented 2 years ago

Hi! I've stumbled upon this discussion which gauged my interest pretty quickly - ACL-aware filtered search is the biggest pain point I've seen in most permissions systems so far.

I've particularly liked the idea of the roaring bitmap - Reading up on #207 and roaring bitmaps (which made some fun weekend reads), I've got some questions to know on how you think this might work / not work.

Do you know of any search engines/databases with native roaring bitmap querying support? I've started to research roaring bitmap support for ElasticSearch - While it's easy to find that Lucene is using roaring bitmaps internally, it's been less obvious to find roaring bitmap support for query parameters. It mostly boils down to a custom plugin for ElasticSearch, implementing a skip-list like behaviour - like this one https://luis-sena.medium.com/improve-elasticsearch-filtering-performance-10x-using-this-plugin-8c6485516c1a

I've also spent some time reading up on roaring bitmaps (again a fun weekend read!) and came up with the following potential obstacles - I'm curious if you've put these into consideration for the roaring bitmap support?

207 suggests 64-bit integers for the resource IDs.
- Most implementations of roaring bitmaps are on 32-bit integers
- Only some implementations have 64-bit integer support
- There's no standardized serialization format for 64-bit roaring bitmaps https://github.com/RoaringBitmap/RoaringBitmap/issues/351
- Only the Java implementation has support on the faster implementation with ART data structure; https://github.com/RoaringBitmap/RoaringBitmap/issues/403

Thanks ahead! (And thanks as well for the fabulous work already put into SpiceDb, I've only discovered it a few days ago and it's already been a great joy working with it!)

josephschorr commented 2 years ago

Do you know of any search engines/databases with native roaring bitmap querying support? I've started to research roaring bitmap support for ElasticSearch - While it's easy to find that Lucene is using roaring bitmaps internally, it's been less obvious to find roaring bitmap support for query parameters. It mostly boils down to a custom plugin for ElasticSearch, implementing a skip-list like behaviour - like this one https://luis-sena.medium.com/improve-elasticsearch-filtering-performance-10x-using-this-plugin-8c6485516c1a

Native has been hard to find, unfortunately. There are, however, numerous plugins available, such as this one for Postgres: https://pgxn.org/dist/pg_roaringbitmap/.

I've also spent some time reading up on roaring bitmaps (again a fun weekend read!) and came up with the following potential obstacles - I'm curious if you've put these into consideration for the roaring bitmap support?

Proposal: Lookup Watch API and Tiger Cache for fast ACL-aware filtering #207 suggests 64-bit integers for the resource IDs.

Most implementations of roaring bitmaps are on 32-bit integers

Only some implementations have 64-bit integer support

There's no standardized serialization format for 64-bit roaring bitmaps Define serialisation format for 64 bit bitmaps RoaringBitmap/RoaringBitmap#351

Only the Java implementation has support on the faster implementation with ART data structure; A novel 64bit bitmap implementation RoaringBitmap/RoaringBitmap#403

Yeah, its been a consideration as well. The idea of using the roaring bitmap at all is to provide the most efficient means of sending the full list of accessible resource IDs to the client; we've also had discussions of providing other kinds of similar caches (all built on top of the same lookup watch API), with different data types if necessary. For example, a bloom filter might be another idea, although it has its own concerns.

I decided to start with roaring bitmap on the theory that, at worst, its fairly easy to convert into other forms if necessary

henkosch commented 8 months ago

@jzelinskie Regarding your answer in the LookupWatch API draf implementation PR, here is our use case described at the beginning of the thread. We are still looking for a way to filter our entities based on permissions and other search criteria at the same time. It would require us to stream permission changes to our consumer services so they can do the filtering in their own databases.

josephschorr commented 7 months ago

https://authzed.com/blog/materialize-early-access is now available for Early Access to solve this problem for very large scales

Krustie101 commented 7 months ago

https://authzed.com/blog/materialize-early-access is now available for Early Access to solve this problem for very large scales

Will this be available on prem? When will more details be made available? Not having this kind of solution available is a blocker for us even considering SpiceDB. Does this solution require us to use PostgreSQL as the datastore for the read models of our domain?

vroldanbet commented 7 months ago

@Krustie101 early access is only in the dedicated offering, we will explore on-prem as part of the SpiceDB Enterprise on-prem offering after that. For now, it supports CockroachDB, and PostgreSQL is the next in line. What datastore are you using?

Krustie101 commented 7 months ago

@vroldanbet Thanks for the reply. The datastore for our business domain read and write models is currently SQLServer but PostgreSQL will be our preferred datastore for our new bounded contexts. I assumed that the choice of data store for our business domain entities and read projections (we use CQRS) did not matter, but that we need to choose CockroachDB or PostgreSQL for the SpiceDB datastore.

Does this change if we want to use Authzed Materialize? If we choose PostgreSQL for our SpiceDB datastore, do we also need to choose PostgreSQL to store our business domain read projections. This would not be a blocker, but I am curious to know if and why this is the case. Has it something to do with "roaring bitmap" support in PostgreSQL?

vroldanbet commented 7 months ago

I assumed that the choice of data store for our business domain entities and read projections (we use CQRS) did not matter, but that we need to choose CockroachDB or PostgreSQL for the SpiceDB datastore.

That is correct, specially with CQRS, were you will treat SpiceDB as another projection of your business data.

Does this change if we want to use Authzed Materialize? If we choose PostgreSQL for our SpiceDB datastore, do we also need to choose PostgreSQL to store our business domain read projections. This would not be a blocker, but I am curious to know if and why this is the case. Has it something to do with "roaring bitmap" support in PostgreSQL?

No. Materialize provides streaming APIs you listen to and then write back to any database you choose. We are exploring deeper integrations with various data storage engines in the future, but for now the customer is responsible to ingest the output from Materialize.

authzed / spicedb

Question: Filtering collections based on permission AND also on external search params #280

207 suggests 64-bit integers for the resource IDs.