Multiset semantics for Coins and DecCoins

JimLarson commented 2 years ago

Summary

Cleans up potentially dangerous inconsistencies in Coins and DecCoins by establishing a multiset model for semantics and a normalization policy.

Problem Definition

Coins (and DecCoins) are sometimes vague and surprising in their semantics.

IsEqual() returns false when comparing Coins of different lengths, but panics when it encounters Coins of the same length but having a difference in denoms. This appears to be intentional, as there are unit tests for the behavior. There doesn't seem to be a sensible use case for these semantics.
The call a.IsAnyGT(b) sounds like it ought to be equivalent to !b.IsAllLTE(a), but it's something quite different, and deliberately so. For instance, the doc comment says {2A, 3B}.IsAnyGT{5C} = false, despite the fact that the 2 of denom A is greater than the zero amount of A in {5C}. Similarly for IsAnyGTE().
There is inconsistency in canonicalization of Coins values.
- Negative quantities are forbidden by Validate() but returned by SafeSub(). Add() promises not to return any negative quantities but has no checks to enforce this.
- Some methods explicitly handle non-valid values, others panic or return wrong results. For the empty Coins value, NewCoins() uses the empty slice, but Add() and Sub() use nil. Most uses are happy with either, but some tests require one specifically.
There's much hand-wringing about the existence of zero-quantity Coin values but the non-existence of zero quanties in the representation of Coins values. There's question of the validity of an empty Coins value. See #7046 and its references.

This all seems to stem from confusion of the implementation of Coins with the semantics it provides, the lack of clear definition of those semantics, and the absence of a policy of normalization.

Proposal

Multiset model

Each Coins value should be abstractly thought of as a collection of coins of all possible denominations. This makes Coins a multiset, which suggests some operations and their proper behavior.

A multiset has an obvious partial order based on containment: A <= B if and only if for all denoms D, A.AmountOf(D) <= B.AmountOf(D). Note that this is a partial order, so it's possible for neither A <= B nor B <= A.

The partial order gives rise to well-known mathematical structure. See #10995 which adds Min() and Max() operations, which are indispensible in various vesting account calculations.

If the set of denoms was stable, the SDK could have defined Coins as a fixed-length vector of Int, with each denom getting a well-known offset. However, since we expect a large and dynamic set of denoms, with most Coins values having only a few of these nonzero, then the current "sparse" representation makes more sense. But the semantics should be the same for both.

This interpretation settles the hand-wringing over the possibility of zero-quantity denoms in Coin but not in Coins. It confuses the sparse implementation of Coins with the data it provides: you see the zero if you ask with AmountOf().

Normalization policy

The Coins representation as an array of (denom, amount) pairs makes sense, and allows direct representation in proto-derived structs. But it also allows representation of some values that do not match up with the abstract model of Coins:

invalid denoms
duplicate denoms
unsorted entries
zero-quantity entries
negative-quantity entries

When the representation is created with the NewCoins() constructor, all of these are checked for, but protos generate the representation directly, as do some tests and an unknown amount of user code.

Nothing can be done about invalid denoms or negative quantities. Sort() will handle order, and a trip through NewCoins() will handle both order and zero quantities. Nothing fixes duplicates at the moment, but we could potentially consolidate them by adding the quantities together. It's not clear whether reforming such content would fix bugs or merely obscure them.

Methods vary in their support for invalid representations. IsZero() seems to exist (vs Empty()) only for zero quantities. IsEqual() sorts its inputs just to be sure, while AmountOf() silently relies on its inputs being ordered. Application code can't tell what to expect without reading each implementation. This is a mess.

The "highbrow" approach to regularizing this would be to create a ValidCoins type with all of the operations, leaving Coins as the type of the proto field, having only a method Coins.Validate() (ValidCoins, err) to canonicalize the coins.

A more middle-brow approach would be to simply let most methods assume valid input, but require all to produce valid output. Application logic should explcitly normalize values coming from "outside", e.g. transactions, but can consider the store to only hold valid values.

DecCoin

DecCoins is mostly a clone of Coins with Dec instead of Int for quantities, but similarly requires positive quantities for valid values. Thanks to Golang's impoverished type system, we need to duplicate the implementation.

All of the above concerns and suggestions seem to apply.

Specific Changes

Let me sum this all up in the following specific proposals.

Fix IsEqual() to not panic. Though technically an incompatible change, it's hard to imagine anyone relying on the current behavior.
Rewrite IsAnyGT{E}() in terms of !IsAllLT{E}() and change to multiset semantics. Again, an incompatible change, but I think it's likely that the usages actually expect the multiset behavior.
Deprecate or remove DenomsSubsetOf() since the very question doesn't make sense in multiset semantics.
SafeSub should return empty coins if hasNeg is true, to prevent negative-quantity Coins from leaking out. A quick review of the code suggests that the returned coins value is never used if the hasNeg field is set. (Except for one case in a test, can probably be rewritten.)
All methods should handle nil and empty slice the same way. E.g. always test with len(coins) == 0 instead of coins == nil. Code (including tests) should not try to distinguish between the two.
Nevertheless, pick nil as the canonical representation of the empty Coins value since it's the "zero value" dictated by the Go language.
Application code and tests should not create coins as arrays but should instead use NewCoins.
Expect validation / canoncicalization at the boundary (via NewCoins(dirtyCoinArray...) or an equivalent alias) and don't do checks or conversions in other methods. Possibly have a transitional period where all methods check for invalid input and panic.
Equivalent changes in DecCoins.

For Admin Use

[ ] Not duplicate issue
[ ] Appropriate labels applied
[ ] Appropriate contributors tagged
[ ] Contributor assigned/self-assigned

aaronc commented 2 years ago

I don't think multiset is the right abstraction. It is really a map-like structure constructed with an array. You wouldn't model 1000atoms as 1000 atom instances in a set which is what multiset semantics would imply.

An appropriate data structure for any non trivial operations would be an ordered tree IMHO.

With generics we can probably reuse the same logic for Int and Dec instances.

JimLarson commented 2 years ago

Multiset semantics don't commit you to a particular representation. A multiset is sometimes written as a set with repeating elements, but it's also written as a non-repeating set where each element has a nonzero numeric multiplicity attached - which is pretty much what current Coins are, if we can be clear about normalization.

I'm contrasting "multiset semantics" with other interpretations of an array of Coin values. In the discussion of #11200, several commenters believe that there is or ought to be a behavioral distinction between {{"denom1", 7}, {"denom2", 0}} and {{"denom1", 7}}. I think it's possible to build a consistent semantics around this notion - but I can't figure out a use case for it. The current Coins implementation seems to be 95% aligned with multiset semantics, and this issue is just about settling the remaining 5%.

If we can consistently normalize to sorted, non-duplicated denoms, arrays seem simpler and more performant than trees - both have logarithmic lookup, etc. But as long as we settle the semantics, I truly don't much care about the representation.

I'm curious about whether Go generics will be any easier than simply duplicating the code. In the early days of Go, I talked to Rob Pike about the lack of generics. He had spoken with the Java folks who told him about the immense pain of retrofitting generics onto a deployed language. Rob's takeaway was that generics should therefore be put off as long as possible. I would have thought that the lesson was to build them in from the beginning..

peterbourgon commented 2 years ago

Generics are orthogonal to this point.

Each Coins value should be abstractly thought of as a collection of coins of all possible denominations.

My understanding is that set membership is defined by element identity, not property. For example, a set of integers doesn't contain all possible integers, it contains only those specific integers which are in the set. Right? So it doesn't make sense to me that a Coins set would contain all possible denominations with a default "value" of zero. That's not a set! That's something else. You can of course define Coins, or any type, to have whatever semantics you want, and if this is the most useful model, then so be it. But AFAICT the thing you're describing here is not a set, or a multi-set, so it's confusing to use terminology from that domain.

JimLarson commented 2 years ago

@peterbourgon please have a closer look a the Wikipedia entry for multiset, which describes that a multiset over a universe U of possible values can be equivalently thought of as:

a set-like structure that allows repeats of elements;
a multiplicity function mapping U to nonnegative integers;
a subset A of U and a mapping of A to positive integers.

There's a whole range of possible representations of these concepts in a programming language: arrays, hash maps, etc, and Coins is certainly one of these possible representations, optimized for having only a few types of denoms in the collection.

For your set-of-integers example, a set is completely defined by its indicator function f, which in this case would mapping from the entire set of integers to the set {0, 1}, where f(n) = 1 means n is in the set and f(n) = 0 means it's not in the set. So the indicator function, which represents the set, is indeed defined on all possible integers, and it returns the default value of zero for any integer that's not in your set. The multiplicity function above is a generalization of an indicator function, where the range is {0, 1, 2, 3, ...}. Coins directly exposes a multiplicity function - it's called AmountOf().

Now you certainly could define a collection of coins that makes a distinction between a zero-quantity entry and a missing entry - but that doesn't seem to be what Coins implements, and I can't think of a reasonable use case for such a type.

I really am sorry to drag set theory into this - it's just that the behavior of Coins like collections have been studied for centuries, have standard names, have well-known properties, etc.

peterbourgon commented 2 years ago

Given two sets S1={x:123} and S2={y:456 z:789} it seems that Min(S1, S2, x/y/z) would by the current definition each be 0. Is that right? That's nonintuitive to me, but maybe I'm showing my ignorance in saying so. In any case don't worry about me 👍

JimLarson commented 2 years ago

I'm definitely not going to overlook your discomfort with this, because you're demonstrating that my explanations need to be better!

Let me try to anchor this in a use case rather than set theory.

Suppose we have a rule at the bank that when you ask for a withdrawal of a given amount, and you don't have sufficient funds in your account, we give you as much money from your account as we can rather than refusing the request. Let's say you have four different accounts, each holding a single currency:

Account | Denom   | Balance     | Withdrawal | Withdrawal | Comment
Num     |         |             | Request    | Result     |
--------|---------|-------------|------------|------------|---------------------------
1       | dollars | 100 dollars | 20 dollars | 20 dollars | limited by size of request
2       | euros   | 35 euros    | 50 euros   | 35 euros   | limited by account balance
3       | pounds  | 0 pounds    | 99 pounds  | 0 pounds   | no balance
4       | francs  | 47 francs   | 0 francs   | 0 francs   | no withdrawal

In each case, the withdrawal result is the minimum of the balance and the request. Now what would happen if instead we had all the same money in a single account, and requested all the withdrawals in a single request. Should we get the same aggregate result? You bet we should!

Account balance: 100 dollars, 35 euros, 47 francs
Withdrawal request: 20 dollars, 50 euros, 99 pounds
Withdrawal result: 20 dollars, 35 euros

I contend that:

this is the right result (the same as aggregating the per-currency scenarios);
this is a sensible operation to perform on mixed currency values;
the right name for the operation is also the minimum;
it would be completely nonsensical for the withdrawal result to have 99 pounds (money for nothing!) or 47 francs (I didn't order that!) in it;
we don't need or want to list "0 pounds" in the account balance, as we don't list any of the hundreds of other currencies you don't have in your account.

Does that make sense?

peterbourgon commented 2 years ago

I understand and agree with most of your assertions about this example. But I don't think Min is the name for this operation. And I think it may boil down to this:

Each Coins value should be abstractly thought of as a collection of coins of all possible denominations.

I don't think I agree with this, because an element of a set with value of zero isn't the same as that element not being in the set. You can ask a Coins how many USD it has, and it can respond zero when it doesn't have a USD denom represented within itself, but that's not the same thing. edit: Also, the set of all possible denominations is unbounded and unknowable (in general)!

Min on a Coins is, to me, an operation which produces a union set of all represented denominations among all input sets, where each denomination's value is the lowest among all input values for that denomination.

{A:1    } Min {       } = {A:1}
{A:1 B:2} Min {B:3 C:3} = {A:1 B:2 C:3}
{A:1    } Min {B:2 C:3} = {A:1 B:2 C:3}

JimLarson commented 2 years ago

Since Coins has no way to say {A:0} other than by saying {} (see NewCoins(), Validate()), it sounds like you're asking for Coins values to have no way to say "I have zero of something". This would mean:

a feegrant periodic allowance can never run out of coins to spend before the period reset;
bank transactions must always input and output a nonzero amount, and the supply can never run out;
bank accounts can never have a zero balance;
validators must be created with a history of having delivered rewards;
the reward fee pool is never empty;
vesting accounts must always have some vested and some unvested coins;
vesting accounts must always be staking some vested and some unvested coins;

...and so on. So the idea that a missing denom in Coins doesn't mean zero is totally at odds with how the SDK works.

I can imagine a different type that allowed {A:0} and gave it a different meaning than {}. However:

that type isn't Coins (though they'd both serialize to repeated Coin);
I can't think of a use case for this type, and after asking a few times, I haven't heard one offered.

@aaronc The above discussion, along with @robert-zaremba 's comments in #11200, are why I think this issue is independent of the Pulsar work. Pulsar, as I understand it, will let you build many different kinds of data types, but does not decide for you what semantics your data type should have. This can be settled now in the existing Coins type, then ported to any new world of serialized types.

peterbourgon commented 2 years ago

I'm just arguing for a different name for the operation. Something like e.g. Least or LeastDenom or LeastDenomAmong would be more in line with my expectations. And, ideally, don't make it a method! A free function taking ... Coin would be even more clear about its semantics. Again, these are all just my suggestions, and I don't really matter that much :) Take them or leave them.

edit: related

ValarDragon commented 2 years ago

I'm very in favor of multiset semantics for all the comparison operators. I can never use the IsAnyGTE function black box, because I always want the guarantee that every denom is present in the argument. I legitimately think that for comparison functions, the only sensible thing to do is assume not having an element there means it has a value of 0.

For Min, and Max, I think they are better served by intersect and union as the names of the functions, purely because of the confusion listed in this issue.

I don't think there should be a function named Min or Max due to the ambiguity present atm.

robert-zaremba commented 2 years ago

I don't think there should be a function named Min or Max due to the ambiguity present atm.

Agree. But I don't know better and short name.

are better served by intersect and union as the names

This operations are not intersection nor union.

ValarDragon commented 2 years ago

They are intersection and the union respectively as is commonly defined on multisets? Whats the behavior you'd expect out of intersect / union, thats not being done in @JimLarson 's PR?

JimLarson commented 2 years ago

Guys, the naming was discussed at length in the PR, and ultimately it got multiple approvals from core developers and was committed. I and others have since been writing code using the chosen name and semantics. This progress doesn't prevent us from revisiting the issue if necessary, but it does raise the bar for further discussion. If you have an objection, I request that it be grounded in an actual use case, and with a proposal for something better.

As I mentioned above, I think the naming objections will subside if we can get some design clarity on Coins and a consistent implementation. Given this, I'd hope that anyone seeing Min() would expect it to mean the per-denom minimum, normalized into a valid Coins value.

I've got a work-in-progress overhaul of coins.go that is validating this approach. Rather than having each function of two Coins do its own double-iteration with its own quirks and mistakes, I wrote a single double-iterator which efficiently matches up the denoms and also checks for sorting, duplicates, and non-positives as it goes:

// zip takes two valid Coins values and calls function f with their amounts for each denom,
// except those denoms for which both a and b have a zero amount. The function f will be
// called in increasing denom order. Will stop the iteration if f returns true.
// Will panic if either a or b has bad order or quantity.
func zip(a, b Coins, f func(denom string, i, j Int) bool) { ... }

Every two-Coins method is more clearly written using zip(), for instance:

func (coins Coins) Add(coinsB ...Coin) Coins {
        sum := Coins{}
        zip(coins, coinsB, func(denom string, i, j Int) bool {
                sum = append(sum, NewCoin(denom, i.Add(j)))
                return false
        })
        return sum
}

For ordering, I wrote utilities lt() and lte() ("less than", "less than or equal") using zip(), then defined everything in terms of them:

func (a Coins) IsAllGT(b Coins) bool  { return b.lt(a) }
func (a Coins) IsAllLT(b Coins) bool  { return a.lt(b) }
func (a Coins) IsAllGTE(b Coins) bool { return b.lte(a) }
func (a Coins) IsAllLTE(b Coins) bool { return a.lte(b) }

func (a Coins) IsAnyGT(b Coins) bool  { return !a.lte(b) }
func (a Coins) IsAnyLT(b Coins) bool  { return !b.lte(a) }
func (a Coins) IsAnyGTE(b Coins) bool { return !a.lt(b) }
func (a Coins) IsAnyLTE(b Coins) bool { return !b.lt(a) }

This changes the current behavior for the IsAny..() methods, but I'm pretty sure that every current use of the IsAny...() is buggy. At the very least, it looks like a footgun.

I've got some other urgent priorities right now, but I'll try to get back to this soon. In the meantime, I'll probably submit a tiny change to keep IsEqual() from panicking on valid inputs. I've heard no use case for that, and every senior developer I've asked has done a facepalm when I describe the current behavior.

robert-zaremba commented 2 years ago

Whats the behavior you'd expect out of intersect / union, thats not being done in @JimLarson 's PR?

The Min and Max have a very good doc string. Min indeed is an intersection, but Max is not a Union.

{2A, 3B}.Max({1B, 4C}) == {2A, 3B, 4C}
{2A, 3B}.Union({1B, 4C}) == {2A, 4B, 4C}

JimLarson commented 2 years ago

What you described as Max is actually the definition of union for multisets. What you describe as Union is called the sum. Check out the multiset article.

I agree that the definition of union is surprising, which is another reason I prefer Min/Max names instead. In the zip-based reimplementation, these are very clearly the per-denom min/max, just as the Coins-level Add() above is very clearly the per-denom sum.

cosmos / cosmos-sdk