CozySynthesizer / cozy

The collection synthesizer
https://cozy.uwplse.org
Apache License 2.0
209 stars 18 forks source link

EHeapElems is buggy #60

Closed izgzhen closed 5 years ago

izgzhen commented 5 years ago
cozy maxbag.ds --java MaxBag.java --timeout=2

Then compile and run as https://github.com/CozySynthesizer/cozy/pull/59 indicates

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3181)
    at java.util.ArrayList.grow(ArrayList.java:265)
    at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:239)
    at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:231)
    at java.util.ArrayList.add(ArrayList.java:462)
    at MaxBag.add(MaxBag.java:183)
    at MaxBagMain.main(MaxBagMain.java:25)

I suspect that this is a code-gen bug.

izgzhen commented 5 years ago

@Calvin-L BTW, can you grant me sufficient permission to add labels to issues?

izgzhen commented 5 years ago

It feels like the problem is that EHeapElems is generated as iterating over the underlying ArrayList in Java. However, we might allocate more space than the actual size of the heap, which caused inconsistency here.

izgzhen commented 5 years ago

I think we might need to associate each ArrayList representing the Heap with an explicit heap_size. Or we can't really track its real size under mutations.

izgzhen commented 5 years ago
Zhen Zhang [9:51 PM]
https://github.com/izgzhen/cozy/commit/677e60b9a3120ff76624f770c864f9ada5ff8ae4
I am trying to fix heap code-gen bug (if it is a bug at all)
the basic idea is the we need to add a size field
since array size can’t represent “heap” size, when realloc is there
i guess
imagine that it would be a large diff so I want to hear from you before continuing

Calvin Loncaric [9:53 PM]
The heap operations that need the size take it as a parameter
This allows cozy to use only one side variable if there are several heaps with the same size, or if it is required by a user specified query

Zhen Zhang [9:54 PM]
however, the parameter itself need to be passed in from somewhere

Calvin Loncaric [9:54 PM]
Yes, but that should be handled by the existing code
Whenever a heap operation that needs a length gets created we also create an expression to compute the length
Cozy will determine how to compute the length, in the end

Zhen Zhang [9:56 PM]
yeah, but I feel like “an expression to compute the length” sometimes depends on `len EHeapElems`
and EHeapElems is code-gen’ed as iterating over underlying array …

Calvin Loncaric [9:57 PM]
Ah, it might be that we need to add such a parameter to EHeapElems too then
Like we have for the others
Adding a hardcoded field like you are suggesting could also work, but it is not symmetric with the other operations and it limits cozy's ability to find optimizations since the field is always present
The more logic you can move from codegen to synthesis, the better

Zhen Zhang [10:00 PM]
Good point, let me take a closer look at `EHeapElems`
https://github.com/CozySynthesizer/cozy/blob/master/cozy/structures/heaps.py#L93-L96
cozy/structures/heaps.py:93-96
                    ```n_elems = EStateVar(ELen(EHeapElems(e).with_type(TBag(elem_type)))).with_type(INT)
                    # yielding EHeapElems would be redundant
                    yield EHeapPeek (EStateVar(e).with_type(e.type), n_elems).with_type(elem_type)
                    yield EHeapPeek2(EStateVar(e).with_type(e.type), n_elems).with_type(elem_type)```
CozySynthesizer/cozyAdded by GitHub
:joy:
I think the approach you suggested might be non-trivial

Zhen Zhang [10:12 PM]
I think this might be a fundamentally different problem

Zhen Zhang [10:17 PM]
if we parameterize HeapElems with n, then how would state maintenance work?
State Maintenance sketch will need to rely on two variables
Do we support “relational state maintenance”?

Zhen Zhang [6:38 PM]
ping

Calvin Loncaric [7:25 PM]
There are many interacting forces here
Lemme go find a computer with a keyboard so I can type fast

Zhen Zhang [7:26 PM]
No problem, take your time

Calvin Loncaric [7:29 PM]
> I think the approach you suggested might be non-trivial
Yes, probably. Cozy does a lot, and the “right” fix often touches many of its components.
The code you posted is responsible for building heap-related expressions during enumeration
it does not build EHeapElems expressions because the only way to make a heap is through an EMake*Heap expression, and EHeapElems(EMakeHeap(bag, …)) == bag
so, we can save the synthesis engine some work by just not producing redundant expressions
(of course, the only thing you would lose by doing so is time---correctness would still be ok)
> if we parameterize HeapElems with n, then how would state maintenance work?
the same as it always has: if `n` and `heap` are state expressions, then EHeapElems(heap, n) requires maintaining both
but they can be maintained separately: as long as `n'` is equivalent to `n`, we can do the substitution without affecting the correctness of the EHeapElems call
there is no need for “relational” state maintenance that considers multiple state variables at once
if `EHeapElems(heap, n)` appears as a state expression, then we just need to maintain it as a single bag, and again we don’t need to worry about the relationship between `heap` and `n`
So, as we enumerate possible replacements for `heap` and `n`, you might wonder how we enforce that `n` is the right length
that happens in `check_wf(...)` in heaps.py
which rejects expressions that have an incorrect value for `n`

Zhen Zhang [7:34 PM]
> that happens in `check_wf(...)` in heaps.py
however current well-formedness depends on `len (EHeapElems)`

Calvin Loncaric [7:35 PM]
yep
so, we might need to build an analysis similar to `heap_func` in the same file
one that walks over a heap-type expression and computes the elements as a bag with no heap-type nodes in the expression tree

Zhen Zhang [7:38 PM]
I don’t understand
`heap_func` is already used right?
what might be the diff between `heap_func` and the promising one you said?

Calvin Loncaric [7:38 PM]
heap_func returns the function used to sort the heap’s elements
this one would return the elements
example calls:
`get_elems(EMakeMinHeap(bag, ...)) = bag`
`get_elems(ECond(cond, EMakeMinHeap(A, ...), EMakeMinHeap(B, ...))) = ECond(cond, A, B)`
in general, `get_elems(h) == EHeapElems(h)`, but it does so by eliminating the heap expressions in `h`

Zhen Zhang [7:41 PM]
what is inside `h`? Do you indicate that `h` is a closed expression without free variable inside?

Calvin Loncaric [7:41 PM]
`h` is (almost) any Cozy expression
the (almost) relies on the fact that ALL heap-type expressions bottom out at an EMake*Heap node
there are no truly free heap-type variables
that would change if we ever allowed heap-type arguments to methods, for instance
but it is fine right now

Zhen Zhang [7:43 PM]
> allowed heap-type arguments to methods
`self` is not considered “argument”?

Calvin Loncaric [7:43 PM]
no
Cozy “packs” expressions so member variables look like regular expressions
recall that `x+1` is REALLY `EStateVar(ELen(foo)) + 1` --- everything that would be in `self` is always expanded out
Cozy knows everything there is to know about things in `self`; other arguments are black boxes
so
```query q(h : Heap)
  return len(h)```
would cause problems for us, but having a heap as a member does not
anyways, maybe all this is the wrong way to go
here’s something different that might make more sense:
represent heaps using trees plus a hash map into the nodes of the tree for efficient removal
the tree representation eliminates this whole issue, since we won’t need the `n` parameter anywhere
and the hash map gives us O(log n) deletion; currently deletion is O(n)

Zhen Zhang [7:50 PM]
Yeah, will that be a `record { Tree, HashMap }`?

Calvin Loncaric [7:50 PM]
yep
it might be more straightforward since that change only affects codegen
it will still be difficult to implement, since Cozy’s codegen isn’t well suited to recursive structures right now
but it gives us a concrete benefit and has less synthesis-related weirdness

Zhen Zhang [7:58 PM]
Emm…frankly I still can’t get my head around that we can synthesize the right `n` argument to `EHeapElems`…
I guess that intuition is that synthesizer will automagically maintain such a seperate n in generated class?

Calvin Loncaric [8:00 PM]
yep
eventually the synthesizer might produce an expression like (edited)
`EHeapElems(EStateVar(EMakeMinHeap(...)), EStateVar(ELen(...)))`
two `EStateVar` nodes => two member variables: a heap-type variable and an int-type variable (edited)

Zhen Zhang [8:01 PM]
>`EHeapElems(EStateVar(EMakeMinHeap(...)), EStateVar(ELen(...)))`
No `ELen` in the second one maybe?

Calvin Loncaric [8:02 PM]
well, the `n` parameter is probably going to be the size of a collection in the specification
there might not necessarily be an `ELen`, but it is very likely

Zhen Zhang [8:07 PM]
but in the generated code there will be no ” the size of a collection” since there is no collection expect for the ArrayList for heap elements

Calvin Loncaric [8:08 PM]
nothing under an `EStateVar` ever gets run
(that’s a lie, but bear with me)
instead, it is stored as a variable and maintained incrementally
so we can actually put kinds of expression nodes under EStateVar that cannot ever be evaluated directly
there are no such nodes right now, but we could do it
(just kidding, EHeapElems is one such node it turns out)
the lie is that expressions under an EStateVar get evaluated at construction time to initialize the data structure

Zhen Zhang [8:28 PM]
Okay, so let me try to recap: the bug is that `EHeapElems(h)` is wrong since `h`’s representation is `TArray`, which might have a larger size. Thus one proposal is to provide `n` to `EHeapElems`, in which `n` will be synthesized as an “virtual size” of heap and get updated incrementally from the len of bag the time of construction. What we need to do is provide the `n` as `ELen(get_elems(h))` in each synthesized expression which uses `EHeapElems` (edited)

Calvin Loncaric [8:29 PM]
yep, that’s right
and also update `check_wf` to ensure that Cozy doesn’t accidentally break the precondition that `n` is truly the number of elements in the heap

Zhen Zhang [8:31 PM]
great
mernst commented 5 years ago

It would be more helpful to write a summary of your conversation, than to paste the conversation. That will be briefer and easier for others to understand.

izgzhen commented 5 years ago

Summary of the bug: EHeapElems(h) is wrong since h’s representation is TArray, which might have a larger size.

Thus one proposal is to provide n to EHeapElems, in which n will be synthesized as an “virtual size” of heap and get updated incrementally from the len of bag the time of construction. What we need to do is provide the n as ELen(get_elems(h)) in each synthesized expression which uses EHeapElems. Here is a previous attempt to fix this problem with this proposal: https://github.com/CozySynthesizer/cozy/pull/61. However, we found that the design is flawed (details in PR).

@Calvin-L proposed four potential fixes following-up PR #61:

  1. Enumerating elements of a heap will not be an operation we support, in which case EHeapElems should be a "specification-only" primitive that does not take a length parameter. This is blocked on #67, which I just opened, and will require a lot of work.
  2. Bite the bullet and just add a length field for every heap at runtime, which costs something but sidesteps the problem and would let us eliminate the length parameters from the *Peek operations as well.
  3. Change heaps to be trees instead of arrays, which would give us O(log n) deletion time at the cost of more memory usage, and would let us eliminate the length parameters from the *Peek operations as well.
  4. Ignore this issue entirely; EHeapElems tends to disappear if you run synthesis long enough.

I felt like first we shouldn't ignore it, since there is no guarantee that it will always disappear. And sometimes, it will be confusing to user since we claimed that "it is an anytime algorithm". --timeout also provides a good way for us to measure how Cozy's optimization converges over time. We shouldn't give up on that good property.

I suspect that option (3) implies no less work than option 1. And in some sense it is just an incremental change over design (2). Design (1) implies not just work but also change in the architecture, so I believe that we should plan it as a long-term goal and see if it is worth pursing.

In summary, I think we should temporarily fix this problem by design (2).

mernst commented 5 years ago

I think of Cozy has having two distinct data structures:

Is proposal 2 to do away with the heaparray representation and always use a heap (represented as a pair) instead?

Calvin-L commented 5 years ago

I think we should temporarily fix this problem by design (2).

That seems good to me.

I think of Cozy has having two distinct data structures

I think of it as only one, but some of its operations are different from the ones that normal heaps have since they need a valid length parameter.

It is also true that Cozy has two different "views" of a heap's representation: one view in terms of bags which Cozy uses to reason about the semantics of different operations during synthesis, and one view in terms of arrays which Cozy uses to generate efficient code. For Cozy to be correct, these two views need to behave identically on identical inputs.