bodil / im-rs

Assorted immutable collection datatypes for Rust
http://immutable.rs/
Mozilla Public License 2.0
1.5k stars 113 forks source link

The description of „amortized“ in docs is misleading #83

Open vorner opened 5 years ago

vorner commented 5 years ago

Hello

From the docs in 12.3:

O(1)* means 'amortised O(1),' which means that an operation usually runs in constant time but will occasionally be more expensive: for instance, Vector::push_back, if called in sequence, will be O(1) most of the time but every 64th time it will be O(log n), as it fills up its tail chunk and needs to insert it into the tree. Please note that the O(1) with the asterisk attached is not a common notation; it's just a convention I've used in these docs to save myself from having to type 'amortised' everywhere.

This is AFAIK wrong. Amortized means that while one operation may be higher than the advertised complexity, n operations in a sequence must be bounded by n* the advertised complexity. For example, the classical std::vector::Vec has O(1) amortized push ‒ while a single push can take O(n) to reallocate and grow to twice the size, it happens on every (1/2)^n pushes so it sums to something like ~2 ‒ the trick there is that the bigger the reallocations, the rarer they happen.

On the other hand, O(log n) every constant number of elements is still amortized to O(log n).

ddouglas87 commented 5 years ago

I believe Scala's Vector type is an rrb-tree. Looking at their docs this is how they describe it:

The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.

So it's effectively constant time, not amortized constant time.

source: https://docs.scala-lang.org/overviews/collections/performance-characteristics.html

vorner commented 5 years ago

I don't know what algorithm lives inside the scala implementation or if the docs are correct or what exactly the effectively means there.

Nevertheless, if it is true in the example that push_back is O(1) most of the time, but every 64th operation is O(log n), then from every practical sense I can see the push_back is O(log n) ‒ worst case, average case, amortized… if you use it in an algorithm a lot, the O(log n / 64) thing will just dominate.

ddouglas87 commented 5 years ago

I don't know how accurate this is but: https://stackoverflow.com/questions/20612729/how-does-scalas-vector-work

It appears, on the surface, im-rs has a width of 64, and Scala's Vector type has a width of 32, making them nearly the same data type, despite the difference in name.

If my understanding is correct, im-rs is not amortized time, it's effective time, which does have it's own classification. A hash table has an effective insert time of O(1), for example, not an amortized time. A C++ vector (dynamic array) has an amortized insert time of O(1).

edit: std::deque is a similar data type to an rrb-tree. It's more like an rb-tree. (Not relaxed I believe.) (For more information: https://en.cppreference.com/w/cpp/container/deque) On 64bit hardware std::deque defaults to a width of 64 like im-rs.

Looking up std::deque bigO most sites list it as O(1) time, not mentioning effective, just outright O(1). But they also say O(1) for unorderedmap (hash table). ¯\_(ツ)\

vorner commented 5 years ago

I believe we are each talking about a different thing.

The linked stack overflow answer argues that it is a tree with branching factor 32 (64 in our case) and that you can fit only so many elements into RAM anyway (let's say 45 ‒ 48bits is the best CPUs can do in terms of virtual addresses, they discard the rest, and you need 8 bytes per pointer at least; let's not get into what happens with ZSTs), you get a maximum depth of the tree (which would be 8 here). Such argument is kinda slippery, because then you can say that bubblesort is O(1), because there's only so many elements you can fit into an ordinary vector to sort so we have an upper limit on how long it runs. On the other hand, that limit of 8 is reasonably low so from practical point of view claiming that to be O(1) is probably fine, which mostly everyone does, even if every computational complexity professor must grind their teeth at it.

However, that is not what I was pointing out. I was pointing that the two claims can't be true both at the same time:

These two claims are just plain out directly contradicting each other. If you want to make the above argument about limited depth of the tree, then even the every 64th push takes O(1) ‒ you have to build the argument from bottom up.

That's what I'm trying to point out, that having these two claims next to each other is confusing.