crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.46k stars 1.62k forks source link

[Proposal] Array stable sort #6057

Closed esse closed 2 years ago

esse commented 6 years ago

It may be useful to have the option to stable sort arrays.

I'm thinking about adding it as a flag in Array#sort, defaulting to false, signature looks like this:

def sort(stable = false) : Array(T)

Therefore if needed, it can provide stable sorting by calling: [1,...,-1].sort(stable: true)

Another option is of course to have a separate method (like in c++ or golang): [1,...,-1].stable_sort

Tbh I find the version with parameter a little bit better - if doesn't add a new method to array public interface and doesn't break anything (as it defaults to false).

In spirit if this comment: https://github.com/crystal-lang/crystal/issues/2350#issuecomment-261742027

I would be very happy to implement either of this versions.

RX14 commented 6 years ago

I thought our current algorthm was fast and stable?

asterite commented 6 years ago

Doesn't seem so:

class Foo
  include Comparable(Foo)

  @@id = 0

  @id : Int32

  def initialize(@x : Int32)
    @id = @@id
    @@id += 1
  end

  def <=>(other : Foo)
    @x <=> other.@x
  end
end

foos = Array.new(100) { Foo.new(rand(1..2)) }

pp foos

foos.sort!

pp foos

All @ids with the same @x value should be in increasing order, but they are not.

esse commented 6 years ago

@RX14 - the whole point is, that it isn't:

a = (1..17).to_a
puts a == a.sort { 0 } # => false

(example taken from https://github.com/crystal-lang/crystal/issues/2350#issuecomment-261704788, tested on 0.24.2)

RX14 commented 6 years ago

I stand corrected! What stable sorting algorithms give the best generic results? We have introsort for our standard sort, do hybrid algorithms win in stable sorting too?

RX14 commented 6 years ago

Looks like timsort is the one to go for.

esse commented 6 years ago

Ok, so I will start working on timsort here.

Can you add an in-progress label? Which version would be better - one with extra argument provided to sort or one with the new method?

jhass commented 6 years ago

Is there any downsides to just having one sort that's stable, given the performance is comparable? How does timsort compare to introsort?

esse commented 6 years ago

Timsort got worst space complexity equals to n, while introsort got log n (due to recursion).

Also, there is a reason why C++, or golang have separate algorithms for stable or normal sort. When stability isn't required it's better to have better space complexity.

asterite commented 6 years ago

For the calling style, let's go with a stable: true required named argument.

RX14 commented 6 years ago

@asterite you mean always require an explicit stable: true or stable: false? I don't think that's neccesary here...

asterite commented 6 years ago

I mean, with a default value of false

esse commented 6 years ago

A default value of false would be great - it wouldn't break existing api.

akoskovacs commented 6 years ago

I think the best would be to create an enum item for all possible sort algorithms. So later it could be expanded as needs be. Always adding a new bool parameter seems ugly.

straight-shoota commented 6 years ago

I don't see a reason why you would need support for multiple sorting algorithms. One that is stable and one that is more performant should be plenty for the stdlib.

rdp commented 6 years ago

It has sometimes surprised me that the current sort method seems to "rearrange" the order of equal elements. It would be really nice if stable were the default (or you could go the java route, stable for objects, unstable default for primitive arrays, since it doesn't matter if they get swapped). Java gives their rationale for it here https://stackoverflow.com/a/15154287/32453 "Stability is a big deal when sorting arbitrary objects. For example, suppose you have objects representing email messages, and you sort them first by date, then by sender. You expect them to be sorted by date within each sender, but that will only be true if the sort is stable." but having an unstable option would be nice too. The latest implementations for the JDK seem to be "Timsort" for stable and "dual pivot quicksort" for unstable: https://stackoverflow.com/questions/32334319/why-does-collections-sort-use-mergesort-but-arrays-sort-does-not it would be interesting to have different types available and be able to call them out by name, like "heapsort" etc. in case they work differently on different input data (though you might only need timsort and dual pivot quicksort typically), and also maybe nice to retain an option to use quicksort on normal objects arrays when speed is still preferred...so either way :)

rdp commented 5 years ago

The only question in my head still is should stable be the default?

It makes sense to me if you want to chain .sort_by(&.x).sort_by(&.y) though I guess it could be written as .sort_by(:stable => true, &.x}.sort_by(:stable => true, &.y} but it feels to me as if stable should be the default, otherwise what does sort_by even mean when chained? So I kind of like the java way of stable by default.

(somebody once proposed possibly having #sort_by as stable and #sort as unstable, by default, that might be a bit confusing but is an option?: https://www.ruby-forum.com/t/sort-by-is-not-stable/198761). Maybe I'll run a poll on the crystal forum if there's no objection to doing so, in a few days about this question...

RX14 commented 4 years ago

@rdp .sort_by { |foo| {foo.y, foo.x} }, seems clearer and faster to me.

rdp commented 4 years ago

Mmm that doesn't seem immediately clear to me...plus you can't as easily chain (or chain later based on conditions, it loses incoming initial ordering, if any.)... :|

rdp commented 4 years ago

If we want to go the extra mile then for unstable sorting of int's (that takes slightly more space) there's radix sort and for strings "burst sort." May as well go for the very fastest, right? Hmm those would be non in place sorts...now things are getting confusing option wise...though I guess we could add an :in_place option later...

konovod commented 4 years ago

@rdp there is unstable inplace and stable but non-inplace versions of radix sort. I've tried to port them to crystal some time ago, and they are pretty fast (though my implementation seems to have some bugs).

rdp commented 4 years ago

OK I am trying to figure out what the best defaults would be:

My current hypothesis is to have unstable be the default only if it is an Array of primitive type (ints, floats, Strings) that are sorted using the standard <=> comparator.

This way if developers new up an array of ints, and call sort on it, they get the fast algorithm by default (which is how java does it, primitives use an unstable sort). The theory being that "primitives" are "inter changeable" so it doesn't affect the output if, for instance, various 3's happen to get all mixed out of their initial order, it has the same final outcome as if the sort had been stable.

For "normal objects" (using the default comparator <=>) it should default to "stable sort" so that previous ordering isn't lost (like how Python and Java do, for that reason).

It seems, if blocks are used, even for Arrays of primitives, that stable sorting is preferred.

ex: if I want to put "all 3's" at the front of an array [lame example, but still reasonable]:

(1..17).to_a.sort_by{ |i| i == 3 ? -1 : 1 }

I'd expect the output to retain incoming order for equal elements, ex: [3,1,2,4,5,6, ...] , so it seems to me the default when using a block should be stable sort, as well.

In terms of actually introducing the modified behavior, in my mind if the defaults are appropriate, and overridable, maybe just do it with a release note warning breaking change, specifyunstable: trueif you need the old behavior? Is that enough?

I suppose we could first do a release with unstable: true everywhere and then warn everyone to explicitly call it out before the next release (if they depend on unstable sorts?), then change the defaults to the above (stable) with some subsequent release.

But even though sorts today are using unstable, people will still get the same "sorting order" (but not exact order) they were specifying with new stable sort defaults, so I don't think there is much, if any, dependence on unstable sorts [?].

But if you prefer to (in essence) deprecate it first (unstable: true everywhere for 1 release, then new default) that would be OK.

If no feedback I'll go ahead and move forward with a PR that includes the defaults above [and also uses merge sort for now for the stable sort--somebody could implement the more complicated Timsort or Quadsort to optimize it at some point, I'm mostly focused on getting the API and behavior right, at this point].

Thanks!

straight-shoota commented 4 years ago

Sounds good!

I don't think we need an expensive deprecation cycle. The sorting behaviour has so far not been specified, so the results just arbitrarily reflect implementation details. Maybe someone depends on the specifics, but I doubt there's a huge impact if we change it. Maybe some spec values need to be reordered.

z64 commented 3 years ago

Bump - would love to have a stable sort as described by @rdp. We use tons of sorting in our app, and forgetting these properties of cr's sort has bitten us sometimes.

asterite commented 3 years ago

Please send a PR

straight-shoota commented 3 years ago

With #10163 on the table I'd like to discuss default behaviour. It is my understanding that unstable sort is generally more efficient (although it depends on particular data, so there may be no significant difference on some search input like already sorted data). Stable sort is more what you would naturally expect. So in my opinion, stable sort should be default. If you want more optimization and don't care about stable sorting, then you can opt in to use a more efficient search algorithm.

Making stable the default is technically not a breaking change in the API because the behaviour for equally comparing elements is undefined as of now. It just depends on the implementation.

In practice, surely it may break some code. I'd mostly expect that to be specs, though. And code that depends on the undefined order of equally comparing elements is already broken.

asterite commented 3 years ago

I agree, stable sort should be the default.

rdp commented 3 years ago

I once experimented with stable sort as the default "except for Numeric and String" (since they're typically "indistinguishable" stable order doesn't normally matter). But then again how many people are going to be sorting an Array of int's except for micro benchmarking purposes? And sorting those could be optimized by a later commit using something even more exotic like radix or burst sort. I'd definitely be in favor of stable as default (less surprises, plus it uses Timsort now, which is theoretically better for already sorted arrays etc.). Without stable as default, chained sort_by's don't make sens at all... which can be very surprising...either that or stable "except for Numeric and String" but that can be added later ... :)

asterite commented 3 years ago

Oh, optimizing the cases of built ins like numbers and strings is a good idea.

makenowjust commented 3 years ago

I think this optimization becomes a very exception. We can only apply this optimization into sort or sort! without block against an Array or Slice of numeric values or String values. For example, if we accepted this optimization, the Array#sort! implementation will become the following:

  def sort!(stable : Bool = true, &block)
    # It is the same as now.
  end

  def sort!(stable : Bool | Nil = nil)
    if stable.nil?
      stable = T <= Number || T <= String
    end

    if stable
      Slice.merge_sort!(self)
    else
      Slice.intro_sort!(to_unsafe, size)
    end
    self
  end

But this API design seems unnatural to me. This is backseat driving instead of cleaver optimization.

asterite commented 3 years ago

I think stable should be set to false inside the method if the type is a string or int (or char, etc.). But this is just a small optimization.

straight-shoota commented 3 years ago

Yeah, there's no need for Nil because you can just ignore stable when it makes no difference.

makenowjust commented 3 years ago

We can consider some corner cases in this optimization. For example, array element types are mixed like [1.0f64, 1u64], 1.0f64 == 1u64 but we can distinguish them, we should use the stable sort or not?

I am not sure this optimization works for users. @rdp said, "But then again how many people are going to be sorting an Array of int's except for micro benchmarking purposes?" I have the same question, and I think users only get confusing by this for now.

asterite commented 3 years ago

Yes, let's not worry about optimizations right now. These can come later.

straight-shoota commented 3 years ago

We talked about this feature in the last core team meeting. While we are eager to have stable sort and the proposed implementation in #10163 looks really great, there have been some concerns about the API and we'd like to refine that.

There is no need to select stable vs. unstable sort at runtime. So this should not be determined by a method parameter. Instead, there should be separate methods. Depending on which we want to promote as default, these would be #sort and either #unstable_sort or #stable_sort. Golang for example has Sort and Stable functions, so unstable is the quasi default. The question is whether we want a more performant default or if the expectation that sorting is stable is so strong that we wouldn't want to break it. Alternatively, the question has been raised again, whether we should actually have unstable sort in the standard library, if stable sort is good enough for all use cases and only somewhat less performant. If raw performance is an issue and stability not relevant, an unstable sort implementation could be provided as a shard. Python for example has only a stable sort implementation.

So, the options we're talking about are:

  1. #sort and #stable_sort (unstable default)
  2. #sort and #unstable_sort (stable default)
  3. #sort (stable only)
straight-shoota commented 3 years ago

I'm personally not very decided at the moment. A decision needs to be made, but all options have good arguments.

However, a considerations on perspective: There is a strong contrast between not having unstable sort at all and making unstable the default over stable. The latter emphasizes on performance, while the former doesn't even provide a performance option. Yet, we value both alternatives as acceptable.

asterite commented 3 years ago

I'm personally in favor of making sort be stable, and providing an unstable_sort as an alternative, in the standard library. We can document that sort is stable, and that if you don't care about that property and you need a bit more performance, you can use unstable_sort.

makenowjust commented 3 years ago

According to Rust's unstable sort RFC:

Q: Why is slice::sort stable?
A: Because stability is a good default. A programmer might call a sort function without checking in the documentation whether it is stable or unstable. It is very intuitive to assume stability, so having slice::sort perform unstable sorting might cause unpleasant surprises. See this story for an example.

In a point of this view, I would make sort become stable. I have no strong opinion which unstable_sort or splitting shards are better.

asterite commented 3 years ago

We could probably make sort become stable by default, keep the unstable algorithm somewhere (well, it's in git), and only add unstable_sort if it's requested. My guess is that nobody will request it. And, as said, it could be provided by a shard.

straight-shoota commented 3 years ago

We should have such comprehensive RFCs as well :)

rdp commented 3 years ago

Seems like one of crystal's objectives is "able to be supah fast"...so maybe it'd be good to have unstable available for the handful of times people want it (sorting very large collections? It might be reasonably common, and nice to have so newcomers can realize "wait, there is a faster way to sort? there are different ways to sort at alll?")

HertzDevil commented 3 years ago

If we want unstable sort as an optimization for types that have a total order (e.g. integers), that implementation must stay somewhere in the standard library. There are also places where the sort key is known in advance to be unique, such as in the compiler itself:

https://github.com/crystal-lang/crystal/blob/849d1d1ce6dac18b580767f99681e5062a6ac2e8/src/compiler/crystal/program.cr#L352-L363

Here opaque_id is 0 for Crystal::NilType, and object_id otherwise. The compiler, of course, needs to run as fast as possible. Thus I think unstable sort should continue to be exposed in the standard library.

Because stable sort can opt in to the unstable variant for the types described above, but not vice-versa, it would be slightly misleading if stable_sort doesn't always do that. So I vote for making sort stable and adding separate unstable_sort overloads. The longer method name also signals that it shouldn't be used as often as the shorter name (similar to unsafe_fetch).

asterite commented 3 years ago

We could also intrrnally use unstable sort for integer types and other such types, by doing a compile time check.

straight-shoota commented 3 years ago

I have inadvertently merged #10163 because it was in the merge queue for 1.2.

Thus the implementation is already in master, but with an API that we'd like to improve.

There seems to be board support for making #sort stable and adding #unstable_sort for unstable sort.

Per suggestion in https://github.com/crystal-lang/crystal/issues/6057#issuecomment-873395520, we can implicitly improve performance of #sort for inherently stable data types. That should be treated a separate improvement.

I'd like to focus on getting the API right first.

It might actually be a good idea to introduce a Sortable(T) module for this. The implementation of most #sort* methods is very similar if not identical between Slice and Array, and they are all based on four base implementations. Using a module removes repetition and also allows to easily add sorting for other compatible data types. From stdlib, StaticArray would qualify for that.

HertzDevil commented 3 years ago

We should add @[Experimental] to all the sorting methods if we cannot come to a decision before 1.2 is released.

Any container that exposes a writable Slice of all its elements is sortable, but ideally containers like Deque should also be sortable, because of all else fails one can still perform insertion sort on them; it is only that the current Slice-based algorithm doesn't work on those types. Thus the sortable types should precisely be the Indexable types that are also writable. I would not expose such a Sortable module to the public API if it relies on this writable Slice capability.

straight-shoota commented 3 years ago

The six non-modifying sort methods simply delegate to their modifying counterpart on a duped instance. That alone warrants some generalization via shared implementation vs. implementing all these methods for any type that has (non-modifying) sort methods. I'm pretty sure that should as well work for Deque and other types that can't represent their elements as a slice.

The self-modifying variants do not have a generic implementation. That would make them candidates for abstract defs. Shared documentation would also be pretty much the same for any implementing type (except for a customized usage example).

HertzDevil commented 3 years ago

Here is an obviously unoptimized patch that makes Deque sortable after #11057: (any mutable Indexable admits similar default implementations, see #5142)

class Deque(T)
  include Sortable(T)

  def sort! : self
    replace! to_a.sort!
  end

  def sort!(& : T, T -> U) : self forall U
    replace! to_a.sort! { |x, y| yield x, y }
  end

  def sort_by!(& : T -> _) : self
    replace! to_a.sort_by! { |x| yield x }
  end

  def unstable_sort! : self
    replace! to_a.unstable_sort!
  end

  def unstable_sort!(& : T, T -> U) : self forall U
    replace! to_a.unstable_sort! { |x, y| yield x, y }
  end

  def unstable_sort_by!(& : T -> _) : self
    replace! to_a.unstable_sort_by! { |x| yield x }
  end

  private def replace!(other)
    each_index do |i|
      self[i] = other.to_unsafe[i]
    end
    self
  end
end

Now suppose that we want to use unstable sorting automatically for Int32. This property is independent of the actual includer; any Sortable should be able to opt in to unstable sorting. We cannot do this within Sortable, because some of the methods are abstract. The result would be:

class Deque(T)
  def sort! : self
    {% if T == Int32 %}
      unstable_sort!
    {% else %}
      # original implementation
    {% end %}
  end

  def sort_by!(& : T -> U) : self forall U
    {% if U == Int32 %}
      unstable_sort_by! { |elem| yield elem }
    {% else %}
      # original implementation
    {% end %}
  end

  # we cannot do this to sort!(&), because there is no sort key
end

And this boilerplate has to be repeated in every includer of Sortable (or without #11057, the types that implement sorting methods). This could probably be improved later.

straight-shoota commented 3 years ago

Good point about the optimization code. I suppose we could move the abstract defs to internal implementation methods (such private abstract def sort_implementation!) and add implementations for the public methods which can then contain the optimization code.

An alternative might be to override the public methods in a macro included + macro finished block with delegation to previous_def. That's way more hacky, though.

straight-shoota commented 2 years ago

I think this can be closed now. Stable sort has shipped in 1.2.