ekzhang / crepe

Datalog compiler embedded in Rust as a procedural macro
Apache License 2.0
460 stars 16 forks source link

add support for lifetimes in relations #9

Closed hydrolarus closed 3 years ago

hydrolarus commented 3 years ago

closes #2

Input relations, output relations and intermediate relations can all have one or more lifetime parameters.

If an input relation has a lifetime parameter, the whole Crepe type will have a single lifetime parameter and all lifetimes of input relations will use that lifetime (e.g. Type<'a, 'b> will only be used as Type<'a, 'a>).

ekzhang commented 3 years ago

This is interesting, thanks for the really clean implementation. I just added some docs. I have a question about your code - specifically, I understand why the output relations shouldn't have lifetimes if the inputs don't. However, the same for intermediate relations seems less clear since they're not being returned. Do you think something like the following can be ok?

use crepe::crepe;
use maplit::hashset;

crepe! {
    @input
    struct Word(String);

    struct Suffix<'a>(&'a str);

    @output
    struct SuffixAndPrefix(String);

    Suffix(&s) <- Word(s);
    Suffix(&s[1..]) <- Suffix(s), (!s.is_empty());

    SuffixAndPrefix(s.to_string()) <-
        Suffix(s),
        Word(w),
        (w.starts_with(s));
}

fn main() {
    let mut runtime = Crepe::new();
    runtime.extend(&[Word("abracadabra")]);
    let (output,) = runtime.run();
    let expected = hashset![
        SuffixAndPrefix("a".to_string()),
        SuffixAndPrefix("abra".to_string()),
        SuffixAndPrefix("abracadabra".to_string()),
    ];
    assert_eq!(output, expected);
}
hydrolarus commented 3 years ago

Thanks for the feedback!

I would have expected that intermediate relations could refer to the input relations as well, but currently the input relations are stored in a regular HashSet that is also mutated. The mutation only happens once at the start to fill in the input data from the Crepe struct, but the HashSet itself still has the .extend(&updates) part in the code. That prevents intermediates from creating and holding onto references from input relations.

If the input relations are handled in a different way then this could work I think! So if the input relation storages are placed before all the main loops, then it should be possible to keep references around.

For taking the reference to other intermediate relations and not input relations the problem still stands sadly. If a reference to a field in an intermediate relation A is taken and stored in B and then another A is added, the HashSet might move the first relation in memory. So without something like only allowing references when a relation storage reached a fixed point this would be unsafe and not possible.

hydrolarus commented 3 years ago

I implemented binding variables by reference, which allows intermediate relations to refer to input data by reference. To do that the code is iterating over values by ref instead (So for ref __crepe_var in instead of for &__crepe_var in) and creating a copy for expressions expecting copies. When binding a variable with A(ref x) then x will not be a copy but a reference instead.

The code is on my intermediate-lifetimes branch and it passes all the tests, however I would like to do benchmarking first before committing to this change.

This example now compiles and runs as expected (see test):

crepe! {
    @input
    struct Input([usize; 4]);

    struct Value<'a>(&'a usize);

    @output
    struct Output(usize);

    Value(&x[0]) <- Input(ref x);
    Value(&x[2]) <- Input(ref x);

    Output(*x) <- Value(x);
}

I'll set up some benchmarking and then compare the performance.

hydrolarus commented 3 years ago

I am now more confident in the changes I made:

The benchmarking showed me:

ekzhang commented 3 years ago

Great work. Thanks especially for adding the benchmarks! I just made a couple minor (non-functionality) changes and renamed S to CrepeHasher to reduce the chance of name conflicts. I also removed the for ref __crepe_var_ref in ... indirection - don't believe it's necessary? But please correct me if I'm wrong about that.

However, I don't understand how you initialize the input relations now. Can you help me understand why this is necessary? Right now, the code generates empty strata for the input relations and also doesn't initialize the indices properly. For example, changing test_transitive_closure to reverse the order of the clauses to Tc(x, z) <- Tc(y, z), Edge(x, y); leads to a test failure because indices on Edge are not initialized.

---- test_transitive_closure stdout ----
thread 'test_transitive_closure' panicked at 'assertion failed: `(left == right)`
  left: `[(1, 2), (2, 3), (2, 5), (3, 4)]`,
 right: `[(1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4)]`', tests/test_transitive_closure.rs:42:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Is it possible to keep how the inputs were initialized before, i.e., in a stratum with the __crepe_first_iteration and __relation_update semantics? This should initialize the indices properly. If it's not possible because of how lifetimes work, that's okay too - intermediate lifetimes maybe aren't essential given that everything is Copy, and I think correctness is more important here.

hydrolarus commented 3 years ago

Thanks for the review and the corrections!

However, I don't understand how you initialize the input relations now.

It was my understanding so far that input relations are, once set, immutable. So there is no point in updating them. That's why I removed the _update sets for them. I'm not super sure how all that generated code works and why, so it's totally possible I misunderstood and that's an incorrect change to make. In that case it should be reverted of course!

Can you help me understand why this is necessary?

I think this is because if intermediate relations keep references to data inside of the storage of an input relation then that storage must not change any more. If the storage gets modified via .extend(input_updates) then all elements might be re-hashed and all the references would be invalid.

It could be that with the copy/ref split now that's less of an issue, but at least it gave compile errors before, I think because of that. I will investigate that, unless you say references inside intermediates are not worth the trouble, of course.

For example, changing test_transitive_closure to reverse the order of the clauses to Tc(x, z) <- Tc(y, z), Edge(x, y); leads to a test failure because indices on Edge are not initialized.

Ah, yes I probably forgot about the index then and wrongfully removed that :see_no_evil:

Is it possible to keep how the inputs were initialized before, i.e., in a stratum with the __crepe_first_iteration and __relation_update semantics? This should initialize the indices properly. If it's not possible because of how lifetimes work, that's okay too - intermediate lifetimes maybe aren't essential given that everything is Copy, and I think correctness is more important here.

Definitely agree about correctness, I'll add this reverse-order of the test_transitive_closure as a regression test. About reverting the initialisation, I'll give it a spin!

hydrolarus commented 3 years ago

Seems like I just did something wrong in the beginning and the initialisation worked with lifetimes all along. Thanks for pointing that out to me!

I also added a test to make sure swapping the order of clauses won't change the result again.

EDIT: Also just to mention it, with the latest changes the benchmarks show that performance is unchanged (for using run() without Fnv to have a fair comparison) between this change and master

ekzhang commented 3 years ago

Thanks, this looks great!