Prerelease math filter operators don't allow duplicate inputs

inmysocks commented 5 years ago

To see a quick demonstration of the problem you can go to tiddlywiki.com/prerelease and make a new tiddler and paste this into the text field:

<$list filter='1 1 1 1 +[sum[]]'/>

the expected output is 4, but because filters enforce uniqueness the output is 1

I am looking at the list widget and filtering code to see if there is a simple way to add an attribute that would allow duplicate entries, but I am a bit concerned about editing any of it because of how carefully the list and filtering has been designed and how deeply integrated into tiddlywiki it is.

inmysocks commented 5 years ago

I believe that the uniqueness is enforced by the pushTop function which is used in creating the filtering function in filters.js (links to the code below).

My first thought is to add an attribute to the list widget that indicates that duplicate entries are allowed and pass that down though to the pushTop function and if it is true than skip the part enforcing uniqueness.

I don't think that this would have any effects if the flag were not set and that it would be almost identical to the change made to parseStringArray(#2027) that let it return duplicate entries without changing any other behaviour.

I will make a pull request but it may take a few days before I have time to put it together.

Hopefully before then someone will point out that I am wrong and that it would be unnecessary.

https://github.com/Jermolene/TiddlyWiki5/blob/0e6855eba8945e023cc5ef19809cfb6d6025627b/boot/boot.js#L76 )

https://github.com/Jermolene/TiddlyWiki5/blob/0e6855eba8945e023cc5ef19809cfb6d6025627b/core/modules/filters.js#L245

diego898 commented 5 years ago

a probably completely unrelated hack, but [title[1 1 1 1]split[ ]sum[]] bypasses this issue

Jermolene commented 5 years ago

Thanks @inmysocks we need to fix this for filters in all contexts, not just within the list widget.

The core of the problem is in the code that handles merging runs into the accumulator:

https://github.com/Jermolene/TiddlyWiki5/blob/master/core/modules/filters.js#L247-L268

if we could work around the calls to "pushTop" there, then we'd just be left with the calls to "pushTop" that are within specific operators. But they almost all have semantics that naturally exclude duplicates (e.g. tagging, indexes, fields).

One possibility might be a magic prefix at the start of the filter that flips off the duplicate removal.

Jermolene commented 5 years ago

OK just as an experiment, I tried replacing every occurence of $tw.utils.pushTop with Array.prototype.push.apply in $:/core/modules/filters.js, and there's a couple of interesting points:

embarrassingly, the test suite in bin/test.sh still runs successfully
running the wiki seems to work OK so far...

Jermolene commented 5 years ago

I've now tried another experiment where I've reverted filters.js, and instead instrumented $tw.utils.pushTop so that it logs any duplicates that are removed. In normal use of the core, it seems that removing duplicates just doesn't happen.

So, now I've extended the original experiment by:

as before, replacing every occurence of $tw.utils.pushTop with Array.prototype.push.apply in $:/core/modules/filters.js
hacking $tw.utils.parseStringArray to always allow duplicates

The result is that index.html is still built the same (allowing for those mods to the code), and still seems to work OK in the browser.

So, to my intense surprise, I'm beginning to think that this might be a workable solution, despite the fact that it is blatantly not backwards compatible...

inmysocks commented 5 years ago

I think that some sort of filter prefix may be a better option than something that is not backwards compatible. Perhaps something like the +, - and ~ prefixes could work? I don't know much much sense that makes conceptually but from what I can see in the code it should be possible.

If that does work we could just modify $tw.utils.pushTop to have an optional input to allow duplicates and add a new case to the switch statement for filter prefixes in filters.js that calls $tw.utils.pushTop with the input to allow duplicates in the results.

I feel like I am missing something that would prevent this from working but I have no idea what. I will try a quick experiment in the morning.

Jermolene commented 5 years ago

Hi @inmysocks

I think that some sort of filter prefix may be a better option than something that is not backwards compatible. Perhaps something like the +, - and ~ prefixes could work? I don't know much much sense that makes conceptually but from what I can see in the code it should be possible.

Something like the +, - and ~ prefixes was my first thought too. But the trouble is that to fix an example like 1 1 1 1 +[sum[]] we'd need a prefix before each of the digits: :1 :1 :1 :1 +[sum[]] is pretty awful regardless of which character we choose. We could use a variant of the enlist operator that allows duplicates: [enlist-dupe[1 1 1 1]] +[sum[]] but that is prety nasty too.

If that does work we could just modify $tw.utils.pushTop to have an optional input to allow duplicates and add a new case to the switch statement for filter prefixes in filters.js that calls $tw.utils.pushTop with the input to allow duplicates in the results.

I feel like I am missing something that would prevent this from working but I have no idea what. I will try a quick experiment in the morning.

I'm still kind of in shock from the conclusions of my experiments above. But, provisionally, it seems as though the core may in fact never rely on the current duplicate removal behaviour, and that we'll struggle to find real world examples outside the core that do either.

I'm working on some changes to add a global switch for whether duplicates are allowed or not (the flag can't be stored in a tiddler because we need to configure the duplicates setting before we read any tiddlers). I'm thinking that we could put out a prerelease with the change for people to test, with the ability to flip the switch and rever to the old way if they need to.

The next step is to add a mode whereby every filter is evaluated twice: once with duplicates and once without. Then we can dynamically detect any instances where different results are being shown.

There are quite a few steps before we'll be in a position to be confident about this proposal but I'm pretty optimistic so far.

inmysocks commented 5 years ago

I am not sure if we need to go as extreme as you are suggesting. While making my test I ran into the same problem you are describing with the prefixes so instead I tired adding a prefix to the full filter. I think that works much better conceptually because 'allow duplicates' is conceptually nothing like the other prefixes.

So far I have done very minimal testing but I will upload a demo wiki and start a pull request so people can see the code changes.

I agree that there is quite a lot to do before we can confidently say that this works, but if it can work I am pretty confident that we can do it without too many drastic changes.

The demo wiki is here: https://ooktech.com/jed/ExampleWikis/FilterDuplicates/

I am working on the pull request showing the code changes is #3759

xcazin commented 5 years ago

Hi @Jermolene, @inmysocks,

Following up on @diego898's remark above: what about a class of operators that took a serialised tiddler list (a string, that is) as its input, instead of the normal deserialised tiddler list. Take for instance the following filter:

[tag[Reference]count[]] [[4 12]] +[sum[]]

When I test it on http://tiddlywiki.com/prerelease, the input of the sum operator is the tiddler list 12 4 12 so the filter (wrongly) outputs 16. if sum was expecting a string instead, its input would have been the equivalent of "12 4 12". It would look the same for the user, but if sum and operators of its class could always trigger a hidden operation —such as the enlist-dupe above— before their actual operation, the output would be the one expected.

I guess that this would mean giving up greedy deduplication before applying operators, but since the other options also seems relatively costly, I thought I'd give my 2cts.

AnthonyMuscio commented 5 years ago

Just a couple of ideas

Could we have a filter like the subfilter in which de-duplication does not occur?
Could we have an alternate to filter= that did not remove duplicates as this may prove useful in other cases.

[Edited for Code]

Working title "serialfilter"

filter="[serialfilter[1 1 1 2]sum[]]" or
<$set name=varname serialfilter="1 1 1 2 +[sum[]]"

Thus
filter="[serialfilter<variable>sum[]]" filter="[serialfilter[$param$]sum[]]" filter="[serialfilter{!!fieldname}sum[]]" filter="[serialfilter{tiddlername}sum[]]"

In both cases it could throw a flag that stops the removal of duplicates in other filter steps Using the serialfilter operator could operate no-dups in the current "Run" only thus allowing it to lead or follow regular filters.

Regards Tony

inmysocks commented 5 years ago

Anthony,

That may work, but I think that the single operator may be difficult to implement based on how the filtering sequence works, each operator has an accumulator that removes duplicates and with the way filters work there isn't any messaging between operators other than the output of the previous operator.

I think that the filter operator idea where it only affects a single run would be far more confusing than helpful, because it would only allow duplicates in that one operator, anything after it wold remove duplicates again and anything before it would not output duplicates. Unless there is a rather large rework of the filtering operation I think that duplicates are an all or nothing thing for each filter.

Having a different filter type would work, and that is not a bad idea for distinguishing between them. I think that regardless of how that part is implemented it would, on the back end, be an alias of filter with some flag set on the filter similar to the prefix used in the demo I made, once I get that to work in all the different contexts.

An alias should be pretty simple to implement, but we would have to make sure that it is added to everywhere you can put in a filter. If it is a prefix or suffix flag than it should work everywhere filters are used. I like the suggestion by @BurningTreeC on the pull request (#3759 ) of using * or + at the end of the filter as the flag to allow duplicates because that sort of follows the logic from regular expressions and separates it from the +, - and ~ prefixes already used in filters.

AnthonyMuscio commented 5 years ago

How is this going, I hope mathematics will not miss the 20 version?

AnthonyMuscio commented 5 years ago

Some thoughts on this if I may

[Edited] additional examples of sets

I was wondering if in Diegos example of "[title[1 1 1 1]split[ ]sum[]]" the the "[title[1 1 1 1]split[ ]" could be given and alias eg "[#[1 1 1 1]sum[]" ie "#" equals "[title[1 1 1 1]split[ ]]" the # meaning number or strict set, and the contents are not deduped.

As current (I do not know how to divide yet)

<$set name=set value="1 2 3 4 5 6 1 1">
<$set name=count filter="[title<set>split[ ]count[]]">
<$set name=total filter="[title<set>split[ ]sum[]]">
<$set name=mean filter="[<count>?<total>]">
Set: <<set>><br>
Count: <<count>><br>
Total: <<total>><br>
Mean: <<mean>><br>
</$set></$set></$set></$set>

Would read

<$set name=set value="1 2 3 4 5 6 1 1">
<$set name=count filter="[#<set>count[]]">
<$set name=total filter="[#<set>sum[]]">
<$set name=mean filter="[<count>?<total>]">
Set: <<set>><br>
Count: <<count>><br>
Total: <<total>><br>
Mean: <<mean>><br>
</$set></$set></$set></$set>

And an operator that sums the first number(s) and divides them by the count would provide a mean and other functions.

In my examples you can see I like forcing a set of <$set widgets in maths rather than trying to do it all in a single filter however, what if we had

"[#<set>sum[]] [#<set>count[]] +[dividebylast[]]" the question is if the sum and count are the same do we get one number? In which case the first divided by the last will be 10/10, n/n is always one and the average is one anyway. "+[multiplybylast[]]" may also be valid? but product may do this already.

Perhaps we could alter math filter operators to always operate on a set. In someways this overcomes the confusion between titles and numbers and the handling of duplicates. Without operators like "divideby last" users may be forced to break maths into separate values but this will be easier to read for non programers.

Of course a set can have 0, 1 or many members.

[#sum<set>]
[#sum[1 1 2 3 1]]
[#sum{tiddlername}]
[#sum{!!tiddlerfield}]
[#sum<subfilter>]
[#product<set>]
[#count<set>]
[#min<set>]
[#mean<set>] sum/count
[#set<set>] select every member of a set (inc titles) without removing duplicates
etc...

Additional examples
[#sum[<var1> <var2> <var3>]
[#sum[{!!field1} {!!field2} {!!field3}]
[#sum[$parm1$ $parm2$ $parm3$]
[#sum[$parm1$  {!!field2} <var3> <__parm4__>]

Note since # is an illegal field name there is a reduced impact, it also highlights a special kind of filter that does not dedupe when operating, however this behaviour is limited to the operation in question. Such they can be incorporated into standard filters.

We could also extend the <$set and <$vars to provide another parameter number= such that (redundant example used to illustrate)

<$set name=result numbervalue="[#sum<set>] [#count<set>] [dividebylast[]">
<<result>>
</$set>

Such that nothing in the numbervalue "filter" is deduped.

Any way Food for thought

AnthonyMuscio commented 5 years ago

PS,

Another advantage of using <$set rather than long filters is they can be nested, a value that can be computed at the beginning can be available for the rest of the set widget, and others that are the result of prior values computed in the innermost nests. This should encourage efficient and logical number handling.

Jermolene commented 5 years ago

I'm still inclined to think that the best approach is as per my comment above: we should remove the de-duplication functionality from the filter mechanism.

inmysocks commented 5 years ago

@Jermolene I can't come up with any reason that would be a bad decision. If we do remove the uniqueness in filter outputs than I think that we should have either an operator or a flag to only keep unique titles. There are some very efficient methods to do this so I don't think it would be a significant performance problem.

I think that an operator is a better way to go so that you can have it in one filter run and not the next if you want that for some reason.

Could you make a pull request? I want to see if I can break it.

Jermolene commented 5 years ago

Hi @inmysocks my plan is to make a prerelease later this week that by default does not discard duplicates. I'm trying to figure out a simple mechanism for users to switch it back and forth while we test and evaluate (the challenge is that we need this setting before we start to read tiddlers so we can't represent the setting within a tiddler...)

Jermolene commented 5 years ago

I made a prototype version of TW5 that doesn't remove duplicates from filters/lists, and it works pretty well so far -- see #3790 or try it out at http://allow-filter-duplicates.tiddlyspot.com/#dupes

AnthonyMuscio commented 5 years ago

Jeremy,

I Just did a little test, and as expected

<$list filter="[prefix[M]] [tag[Features]] +[sort[]]">

</$list>

Produces duplicates, I would have though a lot of existing filters were written on the assumption that duplicates would be removed, in effect "space separated title/filters" join two or more sets. Some cases would also use this to test for null cases after excluding one or more titles;

filter="[prefix[M]] [tag[Features]] -[[Modals]] +[sort[]]"

In the above case we always end up with Modals Listed once, when the previous behaviour would not list the Modal title at all.

Then we see filter="[prefix[M]] [tag[Features]] -[[Modals]] -[[Modals]] +[sort[]]" Removes all 2 occurrences of Modals, but the filter now needs to be coded to remove N "Modals", a number we may not know.

Not only does this appear to be not backwardly compatible but by the nature of such code it would be virtually impossible to search the code to establish where a reliance on duplicate removal has being made. This would make a lot of shared code invalid. The current method is in someway part of TiddlyWiki's 4th Generation programming language "style" which allows lists of unique tiddlers to be managed automatically as sets. To me it seems retrograde. It is a fact titles are unique, so the tools to manage this should reflect this.

I suppose you have considered these but I am curious how you think we can accommodate this?

I return to my previous suggestion, can we do something to indicate inside a given run that duplicates will not be run. An example may be;

filter="(1 2 2 3 3 4 2)sum[]" or "(1 2 2 {!!num2} 3 4 2 +[sum[]] )"

Could this not be accomplished using an equivalent of "[1/2/2/3/3/4/2]split[/]]" with a delimiter chosen that is not in the content.

I see value in being able to ignore or apply duplicates in different parts of the same filter.

Regards Tony

AnthonyMuscio commented 5 years ago

Minor edit done

Jermolene commented 5 years ago

Hi @AnthonyMuscio sorry I missed your comment. You are correct in identifying the issues. I'm currently in favour of adding a new mechanism for explicitly removing duplicates when that behaviour is needed. But it's important to note that in some situations where we use filters we can remove the duplicates "outside" the filter. For example, we use the filter in $:/DefaultTiddlers to generate the initial story river. It would be reasonable there to automatically remove duplicates as we apply the filter results.

00SS commented 5 years ago

These new operators that need duplicates are no longer working with unique titles.

Thus I would have thought that instead of a new mechanism for explicitly removing duplicates, any new mechanism would be for explicitly allowing duplicates.

AnthonyMuscio commented 5 years ago

Just to keep this essential issue fresh I have being testing the mathematics operators (which are not listed in the Filter Operators yet).

Using $set filter= and $vars I created the following test code

This suggests a simple method by which to accommodate values we do not want de-duplicated.

Feedback desired.

<$vars
input="a variable" 
n="-33.55"
x="33.55"
y="-6"
nset="1 2 3 1"
>
<$set name=result1 filter="[[testTitle]] +[uppercase[]]">
<$set name=result2 filter="[[testTitle]] +[lowercase[]]">
<$set name=result3 filter="[[testTitle]lowercase[]]">
<$set name=result4 filter="[<input>uppercase[]]">
<$set name=result5 filter="[<n>abs[]]">
<$set name=result6 filter="[<n>abs[]round[]]">
<$set name=result7 filter="[<n>abs[]trunc[]]">
<$set name=result8 filter="[<x>sign[]]">
<$set name=result9 filter="[<y>add[1]]">
<$set name=result10 filter="[<nset>split[ ]add[2]]">
<$set name=result11 filter="[<nset>add[2]]">
<$set name=result12 filter="[[1 2 3 1]split[ ]add[2]]">
<$set name=result13 filter="[get[num]add[2]]">
<$set name=result14 filter="[get[num]split[ ]add[2]]">
 result1: <<result1>><br>
 result2: <<result2>><br>
 result3: <<result3>><br>
 result4: <<result4>><br>
 result5: <<result5>><br>
 result6: <<result6>><br>
 result7: <<result7>><br>
 result8: <<result8>><br>
 result9: <<result9>><br>
 result10: <<result10>> With "split[ ]" works<br> 
 result11: <<result11>> Without "split[ ]" fails<br>
 result12: <<result12>> <br>
 result13: <<result13>> With "split[ ]" works<br> <br>
 result14: <<result14>> Without "split[ ]" works<br> <br>
</$set></$set></$set></$set></$set></$set></$set></$set></$set></$set>
</$set></$set></$set></$set>
</$vars>

Notice how with nset="1 2 3 1" or there being three fields when num - 4 8 4 respectively The following filters return the desired result

[['1 2 3 1']split[ ]add[2]]
[<nset>split[ ]add[2]]
[[1 2 3 1]split[ ]add[2]]
[get[num]add[2]]
[get[num]split[ ]add[2]]

Notably this fails

[<nset>add[2]]

One could say "maths calculations require an input of space separated values", and that the normal behaviour for titles is to remove duplicates.

So I ask you. Could it be said that "split[ ]" ensures the list prior to it is not de-duplicated and is passed to the next operator?

The examples above are valid "work arounds" WITH NO CODING CHANGES which requires the use of split[] in some cases, and if used in other cases still works.

This use of split may not be straight forward so I propose an alternative notation.

What if we provided an alias to split[ ] such as the Pipe symbol "|" that represents passing the output along as a space separated list without de-duplication? Of course this means something differently elsewhere so perhaps we could use "#" to represent space separated list, and its use in handling it as a list of numbers. When in fact all it is is split[ ] (with a space}

Thus the previous would read

[['1 2 3 1']|add[2]]
[<nset>|add[2]]
[[1 2 3 1]|add[2]]
[get[num]|add[2]] Or
[['1 2 3 1']#add[2]]
[<nset>|#add[2]]
[[1 2 3 1]#add[2]]
[get[num]#add[2]]

I know I have not explored all permutations but it seems to me we may already be 90% of the way.

Regards Tony

AnthonyMuscio commented 5 years ago

Ps,

I note using the filter [get[num]add[2]] will generate unexpected results if the num field contains more than one value, using [get[num]split[]add[2]] seems to result in the use of the first number in the num field only.

AnthonyMuscio commented 5 years ago

If all maths operations are available only as filters it would be good if the $vars widget could be extended to also permit a filter to set the value, even if emptyValue and other options are not made available in the $vars widget.

This would allow the constants and filter based calculations to be defined in a single vars widget a bit like a initialisation step in a calculation without needing the corresponding number of closing </$set> statements.

Regards Tony

AnthonyMuscio commented 5 years ago

Bump - I imagine resolution of this issue has a major impact on the 5.1.20 release and the introduction of the mathematics operators. Has a strategy being determined yet? I know I am somewhat under-qualified however I believe my own suggestions here have some merit in keeping such a resolution fairly simple namely an alias for split[].

Regards Tony

AnthonyMuscio commented 5 years ago

Another small discovery

I tested this,

;Numbers
<$list filter="[get[num]]">

</$list>

Sum: <$list filter="[get[num]sum[]]">

</$list>

It is fine with duplicate values in the "num" field.

These filters can also be placed in {{{ [get[num]sum[]] }}}or "set filter="

Perhaps for maths we should have

{{{ [get[num]sum[]] ~[[0]] }}} for the null case

SO this also worked

<$set name=count value={{{ [get[num]count[]] }}}> Average: {{{ [get[num]sum[]] ~[[0]] +[divide] }}} </$set>

Regards Tony

AnthonyMuscio commented 5 years ago

I suppose what I am trying to say here is with some good documentation perhaps we need not worry too much about duplicate titles and maths operators.

eg +special: {{{ [get[num]sum[]] ~[[0]] +[divide<count>] +[divide[5]] }}} works

Is anyone listening ?

AnthonyMuscio commented 5 years ago

Maths usually involves precedence and parenthesis. This seems to me clearly incompatible with the way filter runs work, and the title de-duplication process.

I am coming to believe that using a pair of data/maths operations, where the data is not de-duplicated is sufficient for using maths operators. If we can facilitate the use of multiple data and maths operator pairs in filters we will have all we need.

Regards Tony

Jermolene commented 5 years ago

Thanks @AnthonyMuscio. I'm going ahead with a couple of the changes discussed above:

Adding a new = prefix for filter runs that merges the run without de-duplication - 37bb75f0c
Adding a new suffix dedupe or raw for the [enlist[]] operator - 18fe112da

Jermolene commented 5 years ago

Edited message above to add commit details

twMat commented 5 years ago

Adding a new = prefix for filter runs that merges the run without de-duplication

@Jermolene , how should the following arbitrarylist (with potentially unknown content) give the output one! one! one!

<$set name=arbitrarylist value="one one one">
{{{ [enlist<arbitrarylist>addsuffix[!]] }}}
</$set>

Jermolene commented 5 years ago

Hi @twMat you can use the new raw suffice for the enlist operator:

<$set name=arbitrarylist value="one one one">
{{{ [enlist:raw<arbitrarylist>addsuffix[!]] }}}
</$set>

inmysocks commented 3 years ago

I believe this has been addressed

TiddlyWiki / TiddlyWiki5

Prerelease math filter operators don't allow duplicate inputs #3757