Array#shift is slow - Githubissues

benoist commented 7 years ago

I was using array shift a lot, but I found out it's pretty slow compared to using an index lookup

This is the current implementation

  def shift
    if @size == 0
      yield
    else
      value = @buffer[0]
      @size -= 1
      @buffer.move_from(@buffer + 1, @size)
      (@buffer + @size).clear
      value
    end
  end

If I change this function by

  def shift
    if @size == 0
      yield
    else
      value   = @buffer[0]
      @size   -= 1
      @buffer += 1
      value
    end
  end

It is a lot faster Shifting Array(String).new(100_000, "a") shows the following speed improvement

                 user     system      total        real
shift        1.600000   0.000000   1.600000 (  1.602299)
each         0.000000   0.000000   0.000000 (  0.000487)
fast shift   0.000000   0.000000   0.000000 (  0.000530)

Question is, can it really be implemented like this or is there in important reason the buffer move and clear is required for this operation? I've made the change in the current master branch and tests all pass.

jhass commented 7 years ago

Well, doesn't your implementation leak memory? That is the array will never shrink again?

However I agree we should only move after some threshold, that is move if there' X free space in front of the buffer and also check that when pushing.

In any case check whether Deque doesn't fit better for your usage pattern.

asterite commented 7 years ago

Duplicate of #573

asterite commented 7 years ago

We should probably document that Array#shift is O(n)

benoist commented 7 years ago

Well Deque has the fast shift implementation, but I need other functions from array first before shifting.

asterite commented 7 years ago

@benoist As to why your implementation doesn't work, later when you need to grow the array later you lost the reference to the original pointer to invoke realloc on.

asterite commented 7 years ago

Alternatively we could store an offset inside the array, but maybe that's a big penalty for just this use case.

benoist commented 7 years ago

converting to deque before shift has some overhead of course, but it's already a 160x speedup

                       user     system      total        real
shift              1.600000   0.000000   1.600000 (  1.606989)
each               0.000000   0.000000   0.000000 (  0.000606)
fast shift         0.000000   0.000000   0.000000 (  0.000649)
convert to deque   0.010000   0.000000   0.010000 (  0.001799)

asterite commented 7 years ago

In Ruby they do increase the base pointer... I don't know how they can later reallocate. So it's worth investigating how they do it, it might worth doing the same in our case.

benoist commented 7 years ago

If I'm reading the ruby implementation correctly it seems like they are holding off the memmove until inserts and they double the capacity upon the insert. That would give you the move penalty only once to when you actually need to reallocate. If the array is shifted until empty it would never need the reallocation. Just need to make sure it doesn't leak memory like @jhass pointed out.

asterite commented 7 years ago

But how do they retain the original pointer and the current pointer?

benoist commented 7 years ago

If the array is not shared when shift is called, it will call the ary_make_shared function. I think this makes a copy of the original pointer and freezes it. This pointer is used in ary_modify to determine the shift.

https://github.com/ruby/ruby/blob/trunk/array.c#L590

https://github.com/ruby/ruby/blob/trunk/array.c#L322

RX14 commented 7 years ago

Well, what functions do you need that Deque doesn't have? I would have thought that all Array functions could work on Deque once implemented. If Deque is the right datastructure to use here, then you should use it.

Another option is reverse! then pop to use just Array.

benoist commented 7 years ago

I'm currently using uniq and sort from array, which might be a lot slower to do in deque as those functions are not just pushes and shifts for which deque is optimized. Thats just an assumption though, I might be wrong here.

RX14 commented 7 years ago

Have you tried it? benchmarked it? Don't work on assumptions.

You can always convert the array into a deque rather easilly.

Or if use reverse! and pop to get O(1) element removal from the ends in Array. It just matters which end you remove from.

oprypin commented 7 years ago

There's a bit of a problem: Deque.sort! does not exist, because sorting implementation is hardcoded in Array.

RX14 commented 7 years ago

@oprypin which is exactly why I asked @benoist what functions he needed from array that were not on deque...

benoist commented 7 years ago

@RX14 Yes I know I can convert it to deque or use reverse and pop, but thats not really the issue. If we can make Array#shift a lot faster, isn't that worth investigating? It's not very intuitive to do all these workarounds because Array#shift is slow. The reason I'm making assumptions for the other parts now, is because putting the effort in testing and verifying still leaves this problem unsolved.

If the general conclusion is that Array#shift is as fast as it's going to get for now, then I have no problem if this issue gets closed. :-)

oprypin commented 7 years ago

It's not very intuitive to do all these workarounds because Array#shift is slow.

Yeah, exactly. The non-workaround is Deque.

benoist commented 7 years ago

I would still consider that a workaround, but if thats just me, I can live with that :-)

RX14 commented 7 years ago

@benoist Using the correct datastructure for the job with the correct algorithmic complexity is a workaround?

benoist commented 7 years ago

if Deque is the only data structure that should do shifts, then shift should be removed from array. But I don't think that should be the case. If the overall operations on the same data are faster within one data structure, then it makes no sense to change. I think Array#shift can be made faster.

start = Time.now
a = Array.new(100_000, "a")
a.size.times do
  a.shift
end
puts Time.now - start

Crystal: 1.6069080
Ruby: 0.006131

lbguilherme commented 7 years ago

Ruby implements an optionally shared buffer for the Array. Some Array instances owns the memory they use, and they will themselves reallocate it and free it when needed. But some Array instances are shared, in which they will keep reference to an external reference-counted buffer and maintain an offset on it. This shared Array will never touch the buffer unless it is the only array referencing it. This also has Copy-on-Write semantics. Then an shared Array tries to modify some data, it will first allocate its own buffer and copy everything to it. This is all done without any external impact, so the user of an Array cannot tell the difference, except with timing.

Here is some proof of it, taking @benoist sample:

start = Time.now
a = Array.new(100_000, "a")
a.size.times do
  a[0] = "b"  # Modifying the array will force the array to own its own memory.
  a.shift
end
puts Time.now - start

Running on my computer: (I did not run crystal with release optimizations!)

Ruby without changing array:     0.007283187
Ruby changing array:             3.364405281
Crystal without changing array:  2.3464860
Crystal changing array:          2.3259410

This kind of optimization is nice because it makes some things faster and you only have to pay some costs if you need to pay the costs. But it also brings inexplicable slowdowns in functions that shouldn't ever be slow. Who could tell that a[0] = "b" would take time proportional to the length of the array?

Refs: rb_ary_shift calls ary_make_shared if the Array is not yet shared. Everything that may modify the array call rb_ary_modify, which will make shared Arrays not shared (by allocating memory and copying/moving). All happens here: https://github.com/ruby/ruby/blob/trunk/array.c

benoist commented 7 years ago

@lbguilherme thank you for this explanation! Just to be sure, what would be your suggestion to do with the current Array#shift implementation, leave it as is or change it into something similar to ruby?

lbguilherme commented 7 years ago

I'm not sure in which direction Crystal wants to go here. We could:

Leave it as it is now. shift is popping from the front of a container and it is slow. Document it and so that it is now a burden on the user to adapt to another structure (like copying the data into a Deque before) or pay the cost if he knows the array is small. The nice thing about this is that each operation has a clear cost that can be explained in documentation.
Apply extensive optimizations in all sorts of places. This would have an impact on how data is stored and on pretty much all functions. The end result would be that everything is faster for the average user, but this makes the code more complicated to maintain and it also makes behavior less predictable. Why did writing to a single byte of memory cause my program to slowdown 462 times? This is hard to explain.

Still on point 2, there are simpler optimizations, like keeping an buffer and offset and only reallocating the they differ too much. This would be of much less impact than shared buffers, but still.

Just to be clear, there are possible optimizations for many other functions as well, not just shift. Taking a slice could benefit too, for example.

I don't dislike either solution. Maybe a speed focused container could come as a shard so not everybody would have to fear unpredictable performance. Or maybe is should be in the standard Array itself, so that everybody can benefit the performance. I particularly like optimizing for the average user, even with surprising behavior.

I would like to hear from @asterite on this.

asterite commented 7 years ago

From a user perspective, I always liked how Array, Hash and String are super generally optimized data structures in Ruby. They can share data with other instances, they are mutable and can be made immutable, they have generally fast operations, etc. Of course that comes at the cost of implementing all of that. Maybe in Ruby it makes more sense because it's a dynamic language and implementing other data structures is inefficient, unless implemented in C. In most (all?) compiled languages you have different data structures, like in Java you have ArrayList, LinkedList, Dequeue, etc. That's nice but it's more cumbersome for the user, because she has to pick a data structure.

So... I don't know. If you need to shift a lot maybe just use another data structure? If sort! is missing in Dequeue maybe we should implement that? Maybe you can just use reverse! and then shift afterwards? It's hard to know without knowing what problem you are trying to solve, @benoist

Maybe adding an offset to Array is acceptable, I don't know. Maybe String could also have shared memory with parent strings to form views. I think all of that falls in the field of optimization, and right now that's not important, because that can always be done later without changing the external API.

RX14 commented 7 years ago

This discussion should happen later once optimizations like this come to the top of our agenda, instead of simply features and stability. At that time we'll have a lot more data on array performance in practice.

benoist commented 7 years ago

@asterite I've written a simple column storage database with encoding and to convert the columns to rows again I was shifting the values to form the rows. I'm using iterators because the column storage is not always aligned to values that belong to the same row due to compression.

I'm keeping an offset now and iterate through the array instead of shifting. This allows me to use the sort function without an extra conversion to Deque.

ghost commented 7 years ago

What I find interesting is that I can read more about how ruby implements arrays here on crystal, than I can find on the ruby issue tracker or what not. Sorry for the distraction. ;-)

chris-huxtable commented 7 years ago

With regards to the discussion of tradeoffs. Using a circular buffer like in Deque, only incurs a slowdown when the Array has unshifted or shifted elements. The added overhead is a memcpy of a subset less then half the size of the whole, which is much faster then the realloc that would have to be done every time an element is unshifted or shifted in the current implementation.

For anyone Interested, this is the implementation used by NSArray in Cocoa. http://ciechanowski.me/blog/2014/03/05/exposing-nsmutablearray/

asterite commented 7 years ago

There's also the thing that passing an Array to C is an O(1) operation right now. If we change Array to use a circular buffer that's not true anymore. I guess when passing it to C we'd have to rearrange its elements so that the first item is at the first position, and then we can pass a pointer. But once that's done there's no need to do that again, unless there's a subsequent shift. And since most array instances won't be passed to C maybe that's a good compromise, because the shift penalty is only payed when you pass it to C.

Then the other penalty is having each Array instance be 4 bytes bigger. Maybe that's not a big issue. We could probably try to remove Deque and let Array be implemented like that. It will simplify the choice a user has to make when choosing a data structure, and Array will be efficient as a stack and as a queue (like in Ruby).

lbguilherme commented 7 years ago

Just a note: Array being a few bytes larger is probably not an issue. It could even be an optimization to make Array even larger (say 12 or 16 bytes more) to hold short arrays inline, as short arrays are common. This would need a benchmark, of course.

funny-falcon commented 7 years ago

There's a bit of a problem: Deque.sort! does not exist, because sorting implementation is hardcoded in Array.

Couldn't you sort in reverse order? Then you could use pop. Or you can find index of minimum, swap it with last element, and then pop. Or you may make binary heap, and pop minimum/maximum as you go.

Do you really need sort! + shift, or it is just "quick and dirty way to remove smallest element"?

benoist commented 7 years ago

Do you really need sort! + shift, or it is just "quick and dirty way to remove smallest element"?

Well for my use case it was the best readable version of what it needs to do. I don't really care about actually removing the elements from the array it just saves having to keep an index and checking the index against the size. In ruby the speed was the same, in crystal it was different. That is not a problem, just a bit surprising. As shifting in it's current state is too slow for my use case, i've changed it to just looping through the elements in a sorted order. But if array gets implemented as Deque and shifting is fast again I might change it back for readability.

funny-falcon commented 7 years ago

Looks like sorting in reverse order and pop will also do the job for you.

benoist commented 7 years ago

Yes that would also work :-)

funny-falcon commented 7 years ago

@asterite there is no need in circular buffer. Just move array's content to the begin of allocation when allocation end is reached. That is the way Ruby's Array works actially. 'Shared array' is just implementation trick. Crystal can use just pointer to begin of allocation, size of allocation, pointer to first element and size. In other words, only pointer to first element should be added (or pointer to begin of allocation).

asterite commented 7 years ago

@funny-falcon Yes, that's a possibility. But if you do shift + push in a loop doesn't it always grow the array infinitely? The current pointer would be increased but never decreased.

funny-falcon commented 7 years ago

If Array will be used as Deque, then there should be amortization room, so move doesn't happen too often.

I did fix for Ruby's Array exactly for this scenario (ie Ruby as Deque).

funny-falcon commented 7 years ago

@asterite

But if you do shift + push in a loop doesn't it always grow the array infinitely?

As I've said, when push reaches bound of capacity, and there is a room caused by shifts, elements are moved to begin of allocation.

funny-falcon commented 7 years ago

If i'm not non-grata here, I can make PR for this.

asterite commented 7 years ago

@funny-falcon Sure! No one is non-grata here.

I still have my doubts about this, though: adding four bytes to all arrays for just one method that's maybe not used all the time. And for example in the OPs use case there was really no need for a shift.

oprypin commented 7 years ago

Haven't we already mentioned that it's too early for this kind of optimization? Besides, this is nothing more than a workaround.

oprypin commented 7 years ago

~~Bleh nevermind, not having to use Deque would be tempting.~~ Uh nevermind on the nevermind. See my next comment

RX14 commented 7 years ago

Just a quick question, what downsides are there of a deque over an array with a start offset from the allocation start? Apart from the obvious to_unsafe one, which I don't think is a big deal. I wouldn't have thought that the array accesses would have been slowed down much by the tiny bit of extra maths.

funny-falcon commented 7 years ago

@asterite , single benefit of implementing this is getting rid of Deque, and being closer to Ruby.

All other issues could be solved with programmers discipline.

@RX14 simplest way is to not change usage of @buffer, just move it with shift (and, possibly, with unshift). There is only need to save pointer to allocation, or amount of elements @buffer were shifted from allocation (ie offset). In latter way allocation start is calculated as @buffer - @offset.

funny-falcon commented 7 years ago

I'd preffer to store pointer to allocation.

oprypin commented 7 years ago

First about the upsides of Deque and downsides of this suggested approach (which, frankly, has NOT been clearly described so far). This kind of workaround cannot achieve its asymptotic complexity. Deque guarantees O(1) complexity when using a balanced number of push+shift or unshift+pop. This kind of shifted array would instead occasionally (with a constant factor!) cause an O(N) operation. So the improvement is by a constant factor (or as some would say, no improvement) over the basic Array. Sure, it improves the situation of many consequent shift calls (which Deque also covers very well), but why all the complexity just for this case? And now I'm sure that it definitely cannot replace Deque.

The downside of a Deque is, hmm, I don't know, but it probably destroys various kinds of caches so is unforgivably slower at typical operations you do with it.

RX14 commented 7 years ago

I really don't think that Deque will be any slower than Array in typical operations. The only downsides I can think of is that the prefetcher will get confused when wrapping, and that the branch prediction might fail too. Hopefully LLVM can compile it to a conditional move to avoid having to stall the pipeline.

In actual fact, I think that 95% of arrays will never wrap. Adding elements to the end is by far the most common array mutation op, and so I think most arrays won't ever wrap around.

oprypin commented 7 years ago

In fact, instead of this suggestion, why not actually mix the implementations of Array and Deque? Have a shortcut scenario for when it's aligned. Re-align it to zero when required (to_unsafe) and before expensive operations (sort!).

RX14 commented 7 years ago

@oprypin what do you mean by a "shortcut scenario". The check that offset + i < size will always be true and it'll just act like an array?

crystal-lang / crystal

Array#shift is slow #5126