exercism / ruby

Exercism exercises in Ruby.
https://exercism.org/tracks/ruby
MIT License
551 stars 517 forks source link

Binary search is not well understood #495

Open remcopeereboom opened 7 years ago

remcopeereboom commented 7 years ago

The whole point of binary search is to do array searches in logarithmic time, instead of in linear time. Having recently taken a look at the implementations, I have not found a single one that actually does this. In fact, all the solutions I've looked at actually check to see that the array given is sorted by sorting copy of the array. This means that the solution is at least O(n) because it needs to create a copy and depending on the sorting algorithm (I think the default is merge sort), might well be O(n log n).

I think the README should make this point a lot clearer, but exercism might not be the best place to teach algorithmic complexity (or it might, I don't know). Perhaps we could add a rikki nit to check if the BinarySearch constructor sorts the input array and a call is made to the constructor in the code. Another rikki point might be to scan the code for array slices as these also create copy arrays, violating the O(log n) goal.

While it is possible to actually implement the problem as it stands in an O(log n) manner, by not using recursion, it is rather limiting. Not to mention the fact that non-recursive binary search is actually a very tricky algorithm to code up (it's easy to get off-by-one errors). It is also possible with recursion, but people tend to have difficulty in coming up with the idea of passing the start and end indices around as parameters. That would be a nice thing to teach, but is not really in the spirit of Ruby (programmer happiness)

We should also remove the test that asks clients to raise an ArgumentError if the array given is not in sorted order. It seems to mislead a lot of people and doesn't even give us any security. There is nothing stopping clients from mutating the array passed in, but exercists don't seem to be aware of that. It's a false sense of security and so is arguably teaching them a bad habit.

remcopeereboom commented 7 years ago

I wanted to add that the problems above feel like they are specifically ruby problems. Functional languages for example, can easily pass around array slices as they typically don't need to create copies. Many non-functional languages actually care about performance more than ruby and are a lot "closer to the metal" so thinking in terms of algorithmic complexity is more natural there.

bmulvihill commented 7 years ago

These are good points. I agree that exercism might not be the right place to teach algorithm analysis, and ruby does not lend itself well to implementing some of these "lower level" algorithms (and wouldn't even be realistic in the real world), but understanding the idea behind the exercise (even in ruby) is not a bad thing. I do wonder though how we could enforce a specific implementation (besides using nits from rikki), which I don't really think is possible (or ideal).

bmulvihill commented 7 years ago

I have been thinking about this some more. One thought would be to distance ourselves from the "binary search" aspect of the exercise, and through code reviews/breadcrumbs lead the developer to the conclusion that there are more optimal ways to implement the particular solution.

Really we can only test what a method is doing, not its implementation so maybe the exercise should really be something along the lines of "Find the index of a particular item in a sorted array"

petertseng commented 7 years ago

We should also remove the test that asks clients to raise an ArgumentError if the array given is not in sorted order

Seems good; that is in keeping with https://github.com/exercism/x-common/issues/234

I do wonder though how we could enforce a specific implementation (besides using nits from rikki), which I don't really think is possible (or ideal).

I'm not sure if this is something that would be desirable to replicate, but I'll share it in case it is: Lua creates a special TracedArray that counts the number of times you access its elements.

I have not done this exercise in Lua so I do not know whether this is prone to causing false positives or negatives, etc.

Sieve is another example of an exercise where we can only look at the result, not the implementation, and there are various submissions that will use a non-sieving implementation. Indeed, it can be argued that Exercism exercises in general should not really focus on particular implementations.

remcopeereboom commented 7 years ago

Sieve is another example of an exercise where we can only look at the result, not the implementation, and there are various submissions that will use a non-sieving implementation.

I was actually just about to mention that.


Lua creates a special TracedArray that counts the number of times you access its elements.

That seems like a great idea.

remcopeereboom commented 7 years ago

Something like the following seems like a good idea to me. Using a static array guarantees that we don't get weird array access and by ensuring that it is sorted, we can move the "ensure sorted" assertion out of the BinarySearch algorithm.

# A static sorted array.
# Maintains the invariant that all items are stored in sorted order.
# Constructing a new {TracedArray} takes time proportional to the number of
# items in the array. Returning the size and accessing an element of the array
# takes constant time.
class TracedArray
  # @!attribute [r] reads
  #   @return [Integer] The number of read accesses.
  attr_reader :reads

  # Initializes a new TracedArray
  # @param collection [Enumerable]
  def initialize(collection)
    @array  = collection.to_a.sort
    @reads  = 0
  end

  # Returns the size of the array.
  # @return [Integer]
  def size
    @array.size
  end

  # Return the ith element.
  # @param i [Integer] The index (Negative indices are NOT supported).
  # @return [Object]
  # @raise [RangeError] if i lies outside the valid dimensions of the array.
  def [](i)
    fail RangeError, i unless (0...@array.size).cover? i
    @reads += 1
    @array[i]
  end
end
remcopeereboom commented 7 years ago

Some suggestions for changing the tests:

  1. Remove test_it_raises_error_for_unsorted_list as this is a precondition that we cannot enforce when doing a search without breaking the performance requirements and is really giving false confidence when used in the constructor.
  2. Remove test_it_raises_error_for_data_not_in_list. A better solution is to return a negative number. Raising an exception is both expensive (you'd need to capture the exception, which adds overhead to the client) and not really appropriate. Not finding an item is not an exceptional circumstance when searching.
  3. Remove test_it_finds_position_of_middle_item. Why do we care about returning the middle item? If we did, there should be a method #middle on the array, not on the binary search algorithm.
  4. Add a test with random data. We could use some fuzz testing :)
bmulvihill commented 7 years ago

@remcopeereboom This looks interesting, but I think forcing a user to implement a O(log n) search algorithm might be above the scope of exercism in its current form, and could very well turn off some users who don't have a CS background. I think letting the user implement a solution that works first, and through code review be brought to better implementations is a better pattern.

I wonder though if we could categorize exercises somehow by their intended goal using the topics array in the config. (i.e. algorithm design and analysis/general purpose/language specific/object oriented design etc.)

This way a user who is interested in sharpening their CS skills could pull those types of exercises vs someone who is just trying to get a feel for the language and improve their broader knowledge.

I am curious what do @Insti and @kotp think?

kotp commented 7 years ago

There is a (few?) changes and conversation being made regarding ratings for difficulty and perhaps categorizing based on things being taught per exercise.

I am not sure that this will force the 0(log n) solution, but may encourage it. If it forces it, then the discussion goes away, and that is not a good thing. @bmulvihill are you finding that it does force this aspect?

bmulvihill commented 7 years ago

@kotp I was looking at how Lua was using the traced array where they are ensuring that an element is found in O(log n)

https://github.com/exercism/xlua/blob/master/exercises/binary-search/binary-search_spec.lua#L32-L37

kotp commented 7 years ago

Thanks @bmulvihill wanted to clarify the reference.

remcopeereboom commented 7 years ago

@bmulvihill

I think forcing a user to implement a O(log n) search algorithm might be above the scope of exercism in its current form

Yes, that was my thinking as well. More particularly, I don't think that it is a good platform for implementing specific algorithms. There is also a prime number sieve problem which asks clients to implement the Sieve of Eratosthenes, but very few people do so. Instead, they implement different algorithms that all find the first x primes. That was exercism is much better at, discussing a particular solution to a general (simple) problem. Some solutions might optimize performance, others might not. But the BinarySearch problem is specifically about performance. In fact, I'd ask how you would even say that something is a BinarySearch if it did not have O(log n) performance).

There is simply no reason to do binary search if not for performance. After all, a simple linear search can be used on collections that are not sorted and data structures that are not arrays.

I think letting the user implement a solution that works first, and through code review be brought to better implementations is a better pattern.

I don't think that works for binary search as there really is only one good way to code one up; it is very hard to spot all the array copies (even for reviewers); and the topic of algorithmic performance is not easily (nor adequately) explained in a nit. But perhaps I am being overly negative. I can see this working for a problem that ask to simply find the element in a sorted array, but at the point we can start asking whether the problem is not too simple.

bmulvihill commented 7 years ago

There is simply no reason to do binary search if not for performance. After all, a simple linear search can be used on collections that are not sorted and data structures that are not arrays.

@remcopeereboom I agree with this statement. I think categorizing exercises under certain topics might solve this problem (but that is a larger issue). If I am a CS student, or I just looking to brush up on my algorithms before an interview this is the exact type of problem I would want to solve, and I would want my solution to hold up to the O(log n) requirement. If I am just trying to practice a language I might not be as interested in this exercise or want to be burdened with trying to meet that performance requirement.

I think the exercise difference-of-squares is a great example of how a user can be lead through review and bread crumbs from an O(n) solution to an O(1) by discovering the mathematical formulas behind the algorithm (and this can be done without getting too much into the details of big O). I don't know if that directly applies here because then as you mentioned the problem might be too simple.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.