Simplify lazy type checking of listings/mappings

translatenix commented 2 weeks ago

I marked this PR as a draft because my initial goal is to start a conversation. For details see commit messages.

translatenix commented 1 week ago

Using the amends chain affects the semantics of super.

Does it? Note that a typecast results in an object with no members. (In many cases, the extra object could be optimized away, but that's a separate concern.)

I also don't think this is great for perf. Using the amends chain means that every child receives its own complete set of cached members every time, and also means computing the same value multiple times (which you've seen by the now deleted listing7.pkl).

I think this PR could implement the same optimization that exists in the current code. However, it's not clear to me that this optimization is beneficial in the real world:

it requires populating and querying two extra maps cachedMembers and checkedMembers
it breaks the following invariant, which can often be exploited to improve performance:
- an object whose members have all been evaluated has a fully populated cachedValues map
in many cases, the typecast is done implicitly by the interpreter, and the original object is never shared or evaluated. Example:
```
// foo has type `Listing<String>`
foo { "element" }
```
In the above code, foo { "element" } is never shared as accessing foo returns foo { "element" } as Listing<String>. Hence caching values for foo { "element" } instead of foo { "element" } as Listing<String> has no benefit (only a cost).

I think that my proposed optimization "avoid creating an intermediate untyped listing/mapping", which isn't possible with the current data structure, is more likely to result in real-world gains. Unfortunately, I don't have access to large real-world Pkl programs to test my performance hypotheses. I've considered generating large programs, but such programs may not be indicative of real-world performance.

bioball commented 1 week ago

If we don't share cached members, every node on m1 gets executed twice; assuming that both m1 and m2 are in the path of evaluation. It also means that we double the amount of cached members, and double the allocations (more VmObjects, VmValue, etc).

m1: Mapping = new {
  ["foo"] {
    bar {
      baz = doSomething()
    }
  }
}

m2: Mapping<String, Dynamic> = m1

Here's a "real-world" ish piece of code that ends up being much more costly (run with pkl eval -m <output_dir> and see two traces):

import "package://pkg.pkl-lang.org/pkl-k8s/k8s@1.1.1#/api/core/v1/Pod.pkl"
import "package://pkg.pkl-lang.org/pkl-k8s/k8s@1.1.1#/K8sResource.pkl"

local pods: Listing<Pod> = new {
  new {
    metadata {
      name = trace("my-cool-pod")
      namespace = "dev"
    }
  }
}

local allPodNames = gatherNames(pods)

function gatherNames(resources: Listing<K8sResource>) =
  resources.toList().map((k) -> k.metadata.name)

output {
  files {
    for (p in pods) {
      ["\(p.metadata.name).yaml"] = p.output
    }
    ["all-pods-names.yaml"] {
      value = allPodNames
      renderer = new YamlRenderer {}
    }
  }
}

I think the current approach probably makes the best trade-off here; we have extra fields for every VmMapping and VmListing, but computing every node twice is much more expensive, and, on average, it's a big perf gain.

bioball commented 1 week ago

BTW, quick note: your proposed changes introduces this regression; so we definitely have some corner cases not caught by our language snippet tests:

Given:

class Person { name: String }

people1: Listing<Person> = new {
  new {
    name = "Sandy"
  }
}

people2: Listing<Person(name.endsWith("Smith"))> = (people1) {
  [[true]] {
    name = super.name + " Smith"
  }
}

The changes in this PR produces:

–– Pkl Error ––
Type constraint `name.endsWith("Smith")` violated.
Value: new Person { name = "Sandy" }

9 | people2: Listing<Person(name.endsWith("Smith"))> = (people1) {
                            ^^^^^^^^^^^^^^^^^^^^^^
at test#people2 (/Users/danielchao/code/apple/pkl/.dan-scripts/test.pkl:9)

4 | new {
    ^^^^^
at test#people1[#1] (/Users/danielchao/code/apple/pkl/.dan-scripts/test.pkl:4)

106 | text = renderer.renderDocument(value)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
at pkl.base#Module.output.text (https://github.com/apple/pkl/blob/6a793b7c5/stdlib/base.pkl#L106)

translatenix commented 1 week ago

BTW, quick note: your proposed changes introduces this regression; so we definitely have some corner cases not caught by our language snippet tests:

Fixed here: https://github.com/apple/pkl/pull/789/commits/724482dc799dfbaf19ffb992502fd1363d98ff2c

translatenix commented 1 week ago

I think the current approach probably makes the best trade-off here; we have extra fields for every VmMapping and VmListing, but computing every node twice is much more expensive, and, on average, it's a big perf gain.

Your first example is a downcast caused by partial typing. Intuitively, it feels OK to me that a downcast, which shouldn't be the norm, incurs some overhead. Note that if m2 amends m1, the current optimization won't help.

Your second example is an upcast. This case feels more relevant to me because it cannot be easily avoided. Here, a better optimization is to recognize (and cache) the fact that no type cast is required because Listing<Pod> is a subtype of Listing<K8sResource>.

The number of root node calls made in a limited set of circumstances (*) shouldn't be the only metric an implementation is judged by. I see the following problems with the current implementation, some of which are easier to fix than others:

adds a lot of complexity
adds a lot of state (5 fields and 2 maps per VmListing instance)
makes many VmObjectLike/VmListingOrMapping virtual method calls, which are best avoided in Truffle interpreters
makes accessing cached values more expensive
makes iterating cached values more expensive
Can't avoid an intermediate untyped VmListing in the common case that a property of type Listing<X> is amended in place. That's because the current VmListing can't have both a type cast node and new members.
typeNodeFrame seems unnecessary EDIT: It may be necessary in the current implementation, but isn't in my implementation.
checkedMembers seems unnecessary I think cachedValues could be used instead.
skips required type casts Executing type casts in receiver's and owner's delegate chain isn't sufficient.
uses Objects.requireNonNull as assertion in interpreter code This is a small but unnecessary overhead. requireNonNull is primarily intended for validating method arguments in public APIs and throws NullPointerException.

(*) direct assignment without amendment, type cast cannot be elided, many computed members, computations are expensive

bioball commented 4 days ago

Your first example is a downcast caused by partial typing. Intuitively, it feels OK to me that a downcast, which shouldn't be the norm, incurs some overhead.

I think it might be fine if it was a little bit of overhead, but, in this case, it's twice the overhead (in both space and time). That's quite a lot, and can make it hard for users to reason through the performance of a program. This means that changing or adding a type annotation, which seems quite innocent, might double the cost of something. In large codebases, this can become quite problematic.

Responding to the rest of the concerns:

adds a lot of complexity

adds a lot of state (5 fields and 2 maps per VmListing instance)

Indeed--it would be good to reduce complexity, but I think we should be sharing cached members whenever possible.

makes many VmObjectLike/VmListingOrMapping virtual method calls, which are best avoided in Truffle interpreters

Sorry; can you elaborate on "virtual method call"? Do you mean dynamic dispatch?

makes accessing cached values more expensive

makes iterating cached values more expensive

That's true, although, In the common case, I think this is pretty marginal. In the following code, all of the cached values are stored directly on the delegated (new Listing { 1; 2; 3 } as Listing<Int>), rather than on the delegate (new Listing { 1; 2; 3 }).

foo = new Listing { 1; 2; 3 } as Listing<Int>

As long as cached values are stored on the delegated, member lookups are just as cheap as they used to be.

Can't avoid an intermediate untyped VmListing in the common case that a property of type Listing<X> is amended in place. That's because the current VmListing can't have both a type cast node and new members.

We can probably optimize the case of new Listing<Int> { 1; 2; 3 } even in the current implementation

typeNodeFrame seems unnecessary EDIT: It may be necessary in the current implementation, but isn't in my implementation.

Might be unnecessary in the current implementation too; need to investigate this (thanks for pointing this out!)

checkedMembers seems unnecessary I think cachedValues could be used instead.

It's indeed not necessarily; it's there as an optimization, although, it might not be saving that much.

skips required type casts Executing type casts in receiver's and owner's delegate chain isn't sufficient.

Are you referring to https://github.com/apple/pkl/issues/785? If so, this is fixed in https://github.com/apple/pkl/pull/822.

uses Objects.requireNonNull as assertion in interpreter code This is a small but unnecessary overhead. requireNonNull is primarily intended for validating method arguments in public APIs and throws NullPointerException.

Good point! We should switch to assert for these.

translatenix commented 4 days ago

I think it might be fine if it was a little bit of overhead, but, in this case, it's twice the overhead (in both space and time). That's quite a lot, and can make it hard for users to reason through the performance of a program. This means that changing or adding a type annotation, which seems quite innocent, might double the cost of something. In large codebases, this can become quite problematic.

I think you may be focusing too much on this particular optimization instead of the bigger optimization picture. Anyway, I brought back the optimization and corresponding test here: 2ec4124d

Another flaw of this optimization, or at least its current implementation, is that it only works if the member is, coincidentally, first evaluated for the parent/delegate. This can be verified by changing the order of listing1 and listing2 in listing7.pkl.

That's true, although, In the common case, I think this is pretty marginal. In the following code, all of the cached values are stored directly on the delegated (new Listing { 1; 2; 3 } as Listing), rather than on the delegate (new Listing { 1; 2; 3 }).

I think checkedMembers prevents values from being cached in a single place. Hence cached values can no longer be iterated by iterating an (economic) map, which is much faster than doing a map lookup for each value.

Sorry; can you elaborate on "virtual method call"? Do you mean dynamic dispatch?

Yes. Several VmObject methods are no longer final.

We can probably optimize the case of new Listing { 1; 2; 3 } even in the current implementation

Yes, but not the amendment case.

It's indeed not necessarily; it's there as an optimization, although, it might not be saving that much.

What makes checkedMembers a (small) improvement over cachedValues?

bioball commented 1 day ago

I think you may be focusing too much on this particular optimization instead of the bigger optimization picture. Anyway, I brought back the optimization and corresponding test here: https://github.com/apple/pkl/commit/2ec4124dd8fcefd2d98c1a5a5384d795bcf31496

Another flaw of this optimization, or at least its current implementation, is that it only works if the member is, coincidentally, first evaluated for the parent/delegate. This can be verified by changing the order of listing1 and listing2 in listing7.pkl.

We're certainly on two sides here. You are right that this optimization is only relevant for certain cases of Pkl code. But running into it would otherwise be a significant perf penalty. Doing extra evaluation is expensive, and should be avoided if possible. But thanks for adding this optimization back! I will take a look.

What makes checkedMembers a (small) improvement over cachedValues?

Going through the existing code again, I realized that I introduced checkedMembers at first because the delegatee was not receiving its own cached members. It's actually a result of iterating on the implementation and not re-visiting old assumptions.

apple / pkl

Simplify lazy type checking of listings/mappings #789