getodk / web-forms

ODK Web Forms enables form filling and submission editing of ODK forms in a web browser. It's coming soon! ✨
https://getodk.org
Apache License 2.0
12 stars 9 forks source link

Form design pattern "Repeating as long as a condition is met" pathological infinite loop #205

Open eyelidlessness opened 3 months ago

eyelidlessness commented 3 months ago

The form design pattern is described in the XLSForm Reference section titled Repeating as long as a condition is met. There is potential in using this form design pattern to create a pathological case: an infinite loop, where the incrementing expression's condition is always satisfied by each newly added repeat instance.

This was discovered in CI failures after merging both #187 and #195, specifically failing on the child vaccination smoketest.

I had originally filed this as a bug, until I looked more closely at the child vaccination form definition. Here is the expression that controls the repeat's jr:count (formatted and parentheses for readability):

if(
    /data/building_type = 'single',
    1,
    if(
        (
            /data/flatcount = 0 or
            indexed-repeat(/data/household/finalflat, /data/household, /data/flatcount) != 'yes'
        ),
        /data/flatcount + 1,
        /data/flatcount
    )
)

I initially considered the possibility that the behavior was caused by reactive over-subscription, where the indexed-repeat call references the repeat itself (/data/household). So I tried replacing that call with its LocationPath equivalent, so it would reference only the repeat instance's descendant. Here is the same condition sub-expression, in that format:

/data/flatcount = 0 or
/data/household[/data/flatcount]/finalflat != 'yes'

With this change, the form still exhibited the same behavior, ruling out that narrow hypothesis. I also considered the possibility of broader reactive over-subscription, implicating our implementations of:

To investigate this, I updated my prototype of #203 (addressing #39) to account for the recently merged jr:count/indexed-repeat support. I found the behavior was still identical (albeit much faster to reach the failure mode!).

It was only after adding some logging there to better understand the infinite loop, that I noticed the repeat count was continuously increasing until failure. This got me to look more closely at the condition itself. Specifically here:

/data/flatcount = 0 or
/data/household[/data/flatcount]/finalflat != 'yes'
                                           ^^^^^^^^

There's our infinite loop! Here's what happens:

  1. On load, if there are zero repeat instances, the jr:count increments
  2. On increment, a new repeat instance is created
  3. Once added, the jr:count expression is rerun, checking finalflat in the repeat range's last repeat instance
  4. That value's default state is blank, goto 3

We discussed this in Slack, where I theorized that the reason this works in Collect/JavaRosa is because computations are rerun based on the user's positional state within a form: the infinite loop exists, but does not advance to its next iteration until the user advances to the next repeat instance to trigger it! @lognaturel agreed with this theory.

As I said when I closed #206, we now feel the most appropriate course of action for the CI failure aspect of this issue is to create an alternate version of the form/smoketest which doesn't exhibit this pathology. In this case we can simply replace != 'yes' with = 'no'.

As I understand it, we also believe the form design pattern—while error prone in this way—is relatively niche. It's likely we can address the pathological case with user support.

And again as mentioned in closing #206, I expect the alternate form will also require updating the smoketest itself to account for further progress into the form. Following that, I will likely want to do a very brief timeboxed spike into potential mitigations for this specific pathology.

I think I have an idea for a relatively simple special case: short circuit multiple jr:count changes occurring in the same event loop tick (or at least, in the same blocking period/sequence of contiguous microtasks). I think this is probably something we'll want not just for this pathological case, but to potentially detect others which may present similarly.

I'll also add for posterity that I believe we could detect this case with static analysis alone. I don't know if it's worth the effort! But it is definitely knowable, programmatically, from the form definition itself.

brontolosone commented 1 month ago

just documenting: @eyelidlessness dug up this which is AFAIK where it's mentioned in the spec that cycles in the computation graph are an input error. That's intuitive — in spreadsheets they are too — but it's nice to have it spelled out.