node-set arguments are converted to relative paths when they should be absolute

MartijnR commented 3 years ago

Problem description

node-set arguments are converted to relative paths if the argument node(s) is/are inside a repeat and referred from the same repeat

Steps to reproduce the problem

type	name	label	calculation	readonly
begin repeat	repeat
integer	item1	item1
integer	count1	count1	count(${item1})	yes
integer	countne1	countne1	count-non-empty(${item1})	yes
integer	sum1	sum1	sum(${item1})	yes
integer	max1	max1`	max(${item1})	yes
integer	min1	min1	min(${item1})	yes
text	concat1	concat1	concat(${item1})	yes
text	join1	join1	join('...', ${item1})	yes
text	randomize1	randomize1	randomize(${item1})	yes
text	area	area	area(${item1})	yes
text	distance	distance	distance(${item1})	yes
end repeat

Expected behavior

All item1 should output as /data/repeat/item1, and not as ../item1.

Other information

This is exactly the same problem as the already fixed https://github.com/XLSForm/pyxform/issues/484 for indexed-repeat(), so this fix should probably be integrated with that fix.

Perhaps it would be helpful to maintain an object of XPath functions that contain node-set parameters and list the parameter indices that are node-sets in arrays, e.g.

{
     "count": [0],
     "count-non-empty": [0],
     "sum":  [0],
     "max": ["*"],
     "min": ["*"],
     "concat": ["*"],
     "join": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
     "randomize": [0],
     "indexed-repeat": [0, 1, 3, 5],
     "area": [0],
     "distance": [0]
}

I believe join() can have an unlimited number of parameters, and only the first is not a node-set. So perhaps that should be handled separately, or a syntax can be invented for 1+. Or just not worry about more than 20, since that's very likely to never be used.

gushil commented 3 years ago

Hi @MartijnR What does * means for max, min, and concat ? Is that means all arguments for those functions should be returned as absolute?

Thanks

MartijnR commented 3 years ago

Sorry, I should have clarified that and added the source. This means that all arguments are node-sets. Feel free to use something else though of course.

See source here: https://getodk.github.io/xforms-spec/#xpath-functions

gushil commented 3 years ago

{
     "count": [0],
     "count-non-empty": [0],
     "sum":  [0],
     "max": ["*"],
     "min": ["*"],
     "concat": ["*"],
     "join": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
     "randomize": [0],
     "indexed-repeat": [1, 2, 4, 6],
     "area": [0],
     "distance": [0]
}

Hi @MartijnR, I have working code now but I want to confirm these. Those are the xpath functions and list of parameter positions that are node-sets in array.

So, for example, if ${item} is found as first (or second, or fourth, or sixth) parameter of indexed-repeat, ${item} will be returned as absolute path. Is that correct?

If that is correct, I think for count, count-non-empty, sum, randomize, area, and distance, the indices should be 1, not 0, right? (because the first parameter of those functions that will be returned as absolute path)

I'm using something like this:

xpath_functions = {
            "count": [1],
            "count-non-empty": [1],
            "sum":  [1],
            "max": ["*"],
            "min": ["*"],
            "concat": ["*"],
            "join": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
            "randomize": [1],
            "indexed-repeat": [1, 2, 4, 6],
            "area": [1],
            "distance": [1]
        }

MartijnR commented 3 years ago

@gushil, I used 0-based indices, but made a mistake with indexed-repeat. I've corrected it above. Best to check my work on the others as well with the documentation of those functions.

gushil commented 3 years ago

Hi @MartijnR I've created PR #528 to solve this issue.

Waiting for some review.

Thanks.

lognaturel commented 3 years ago

Sorry I missed this. I'm afraid I'm not in favor of this change!

First, it breaks existing forms that use aggregate functions on nested repeats. This contexts seems like a very good reason to use aggregate functions inside a repeat and is a common pattern for those who choose to use nested repeats. Consider this test:

   def test_count_of_nested_repeat_from_outer_repeat_produces_relative_ref(self,):
        self.assertPyxformXform(
            name="data",
            md="""
                | survey |              |             |             |                   |
                |        | type         | name        | label       | calculation       |
                |        | begin_repeat | repeat1     |             |                   |
                |        | begin_repeat | repeat2     |             |                   |
                |        | text         | foo         | Foo         |                   |
                |        | end_repeat   | repeat2     |             |                   |
                |        | integer      | count1      | count1      | count(${repeat2}) |
                |        | end repeat   |             |             |                   |
            """,  # noqa pylint: disable=line-too-long
            xml__contains=[
                """<bind calculate="count( ../repeat2 )" nodeset="/data/repeat1/count1" type="int"/>""" 
            ],
        )

Second, specifying an aggregate function over a repeat's value(s) from within that repeat is inefficient and feels semantically strange to me. It's also explicitly not supported by ODK Collect because of the performance issue. I always tell form designers to introduce a calculate outside their repeats to do this. In other words, is there any advantage to

type	name	label	calculation
begin repeat	repeat
integer	item1	item1
integer	count1	count1	count(${item1})
end repeat	repeat

over

type	name	label	calculation
integer	count1	count1	count(${item1})
begin repeat	repeat
integer	item1	item1
end repeat	repeat

lognaturel commented 3 years ago

Couple other thoughts that came to mind...

The indexed-repeat issue feels different because its spec requires absolute refs and then indexes are explicitly specified for each path segment.

The existing pyxform behavior is the same it's always been, right? That is, with the original interpretation of absolute paths within repeats, those absolute paths were contextualized according to the repeat instances they were in. E.g. /data/repeat/item1 from within /data/repeat[3] would be contextualized to /data/repeat[3]/item1. When we switched to relative references, that same interpretation continued. Does that sound right?

MartijnR commented 3 years ago

When we switched to relative references, that same interpretation continued. Does that sound right?

No, not in Enketo. This welcome change finally allowed us to evaluate XPath correctly without injecting positions!

First, it breaks existing forms that use aggregate functions on nested repeats.

Yes, indeed. I forgot about nested repeats (I have a tendency to do that ;)). So, it could continue to use a relative path but if used inside the same repeat instance the argument refers to, it could step outside of that repeat (like in #503)?

Second, specifying an aggregate function over a repeat's value(s) from within that repeat is inefficient and feels semantically strange to me. It's also explicitly not supported by ODK Collect because of the performance issue.

Certainly a weird use, but I have no doubt @pbowen-oc has some fine examples that use this... If there aren't good use cases, this issue becomes unimportant indeed.

lognaturel commented 3 years ago

finally allowed us to evaluate XPath correctly without injecting positions

I do think usually folks want relative expressions from within repeats. I understand that wasn't practical to maintain for Enketo so it's good the spec change was made but what I'm saying is that pyxform maintained the same client behavior across that change. Coming back again to ${item1} referenced from a node within /data/repeat[3], my understanding is that:

Prior to the spec change, pyxform expanded ${item} to /data/repeat/item which was interpreted by clients as /data/repeat[3]/item
After the spec change, pyxform now expands ${item} to ../item which is interpreted by clients as /data/repeat[3]/item

If I understand correctly, you're proposing changing the expansion to /data/repeat/item (or ../../repeat/item) assuming that the user's intent is to reference the nodeset of all items. I'm not convinced that's really generally the intent and I don't like that existing XLSForms would be converted to XForms with different semantics. Does that make sense?

Unless I'm mistaken, in the nested repeat case, there is a difference between the example I showed in the test and what would happen if we changed the expansion to ../../repeat1/repeat2. That is, in my test, we're getting the nodeset of the nested repeat for the current instance of the outer repeat. In my experience, that's what users want (e.g. compute the number of people in the household). I believe ../../repeat1/repeat2 would be the nodeset of all instances of the inner repeat for all instances of the outer repeat (e.g. the number of people across all households). I've never seen users want to do this but if they did, pulling it outside of both repeats would make a lot more sense, I think (they could still reference it from within either repeat). What I do see is people using the nodeset of the nested repeat to compute something (usually just a count) and then using the nodeset of those computed values from the form root to get e.g. the total count.

pbowen-oc commented 3 years ago

@lognaturel @MartijnR - We don't work with nested repeats here, so I can't comment on the functionality of those.

I am not sure how these functions worked for us before when using the older version of Pyxform with absolute references. I was assuming there was a change in the form functionality with the switch to relative references as there had been with indexed-repeat(), but I don't know that for sure.

I agree that it would be more efficient to put one of these nodeset functions outside the repeating group and then reference that value as needed within the repeat. I'm not aware of any use case where having the nodeset function within the repeat would not work as desired from outside the repeat. Assuming that is the case, I don't think I see a strong need to update the behavior as described here.

MartijnR commented 3 years ago

This change was not meant to change behavior of existing (newly transformed) XLSForms. When a user uses a function such as count() or sum() it's very unlikely the user would want to include only one ${item1} node instead of all of them.

However, if there is no use case for this...., then indeed we shouldn't bother at all! Thanks for questioning the need for this added complexity @lognaturel.

lognaturel commented 3 years ago

it's very unlikely the user would want to include only one ${item1} node instead of all of them

I do agree that your interpretation of intent is likely correct in a single-repeat context. But I'm very happy to keep "forcing" people to put these computations outside the repeat! Let's say you do have count(/data/repeat/item1) inside /data/repeat, does Enketo actually recompute every single repeat instance and need to iterate over every repeat instance to do so or does it do something smart to factor out the computation?

XLSForm / pyxform