Potential inefficient use of Schema during entity loading

tpaulus commented 1 month ago

Describe the improvement you'd like to request

When Entities are loaded (either from string, or from an iterable of Entities), a new CoreSchema, which profiling has revealed to be an expensive operation for a large schema.

Describe alternatives you've considered

It appears that the Schema is immutable once constructed, and as such, may be possible to have the CoreSchema instantiated upon creation of the Schema to avoid incurring the CoreSchema instantiation cost each time entities are loaded or have a schema applied to them.

Additional context

Profiling Code:

use cedar_policy::{Entities, Schema};
use std::fs::File;

fn main() {
    let schema_file = File::open("schema.cedarschema").unwrap();
    let schema = Schema::from_cedarschema_file(schema_file).unwrap().0;

    let loaded_entities = {
        let entities_file = File::open("entities.json").unwrap();
        Entities::from_json_file(entities_file, None).expect("Error loading entities")
    };

    let schema_entities = Entities::from_entities(loaded_entities, Some(&schema)).expect("Error loading entities");
    println!("{}", schema_entities.iter().count());
}

Is this something that you'd be interested in working on?

[X] 👋 I may be able to implement this internal improvement
[ ] ⚠️ This feature might incur a breaking change

john-h-kastner-aws commented 4 weeks ago

There might be a way to avoid the expensive step in building a CoreSchema, but, either way, we should make this operation cheaper

john-h-kastner-aws commented 4 weeks ago

Copying comments from slack:

I think the cost is coming from the step where we have to invert the action hierarchy: https://github.com/cedar-policy/cedar/blob/8ab5daeb89fcf80ee2aadf20b685aff7163d2b71/cedar-policy-validator/src/schema.rs#L818-L831 The only reason we need to store the action hierarchy in that orientation (edges going from parent->child rather than child->parent) is to support this query efficiently when typechecking in expressions in policy validation here https://github.com/cedar-policy/cedar/blob/8ab5daeb89fcf80ee2aadf20b685aff7163d2b71/cedar-policy-validator/src/schema.rs#L784-L800 so, the idea is to rework how we validate in so that rather than needing to query for what actions are descendants of a particular action, it instead asks for what actions are ancestors of a particular action.

Though, that's an invasive change to the typechecking code, and it still leaves some less expensive but still not free code for constructing the action entities in the CoreSchema constructor, so I wouldn't object to lifting some of to happen once in the schema constructor. If you want to take a stab at that route, I'd recommend trying to just lift the construction of the actions hash map. The rest of the CoreSchema constructor is just wrapping a reference to the ValidatorSchema object, so it would become basically free. That would also let us optimize the action_entities function at the same time.

tpaulus commented 4 weeks ago

It's better with the changes in #1290, but still not as performant as I'd like...

From 38ms to 7ms for our schema and small entity graph.

john-h-kastner-aws commented 3 weeks ago

It's surprising to see a large block in that chart for clone on entity uids, that should be a fairly cheap operation

tpaulus commented 3 weeks ago

It looks like a bulk of the effort is now spent as a result of this line: https://github.com/cedar-policy/cedar/blob/main/cedar-policy-core/src/entities.rs#L216

However, I can't figure out a clean way to change the entity map to be either a Map of EIDs and Entity Borrows, or Entity ARCs, since a number of downstreams (like the TC Calculator) expect owned references of the entities.

john-h-kastner-aws commented 3 weeks ago

Some more context from discussions: performance issues is likely due to needing to clone a large (>1000) set of action entities where these actions are in many action groups.

tpaulus commented 3 weeks ago

https://github.com/cedar-policy/cedar/pull/1296 reduces the need of Arc::unwrap_or_clone().

cedar-policy / cedar