Add Array#sort#bsearch example to bsearch-vs-find

mrThe commented 9 years ago

With Array#sort it looks not so fast, but still faster than find.

Arcovion commented 9 years ago

Entirely depends on the speed of the sort:

DATA = [*0..1_000_000].shuffle

# Calculating -------------------------------------
#                 find     1.000  i/100ms
#         sort.bsearch     1.000  i/100ms
#              bsearch    72.382k i/100ms
# -------------------------------------------------
#                 find     16.799  (± 0.0%) i/s -     84.000 
#         sort.bsearch      6.734  (± 0.0%) i/s -     34.000 
#              bsearch      1.033M (± 1.8%) i/s -      5.212M

# Comparison:
#              bsearch:  1033428.3 i/s
#                 find:       16.8 i/s - 61518.15x slower
#         sort.bsearch:        6.7 i/s - 153468.41x slower

Binary searching only works on sorted data anyway, as mentioned in the readme.

mrThe commented 9 years ago

@Arcovion

require 'benchmark/ips'

DATA = [*0..1_000_000]
DATA_RANDOM = [*0..1_000_000].shuffle

Benchmark.ips do |x|
  x.report('data')        { DATA.sort }
  x.report('data_random') { DATA_RANDOM.sort }
  x.compare!
end

Calculating -------------------------------------
                data     8.000  i/100ms
         data_random     1.000  i/100ms
-------------------------------------------------
                data     91.713  (±10.9%) i/s -    456.000
         data_random      4.554  (± 0.0%) i/s -     23.000

Comparison:
                data:       91.7 i/s
         data_random:        4.6 i/s - 20.14x slower

So, if

require 'benchmark/ips'

DATA = [*0..1_000_000]
DATA_RANDOM = [*0..1_000_000].shuffle

def slowest(data)
  data.find { |number| number > 77_777_777 }
end

def slow(data)
  data.sort.bsearch { |number| number > 77_777_777 }
end

def fastest(data)
  data.bsearch { |number| number > 77_777_777 }
end

Benchmark.ips do |x|
  x.report('find')         { slowest(DATA) }
  x.report('sort.bsearch') { slow(DATA) }
  x.report('bsearch')      { fastest(DATA) }
  x.compare!
end

Benchmark.ips do |x|
  x.report('find')         { slowest(DATA_RANDOM) }
  x.report('sort.find')    { DATA_RANDOM.sort.find { |number| number > 77_777_777 } }
  x.report('sort.bsearch') { slow(DATA_RANDOM) }
  x.compare!
end

Then

✗ ruby -v code/array/bsearch-vs-find.rb
ruby 2.2.0p0 (2014-12-25 revision 49005) [x86_64-darwin13]
Calculating -------------------------------------
                find     1.000  i/100ms
        sort.bsearch     7.000  i/100ms
             bsearch    37.411k i/100ms
-------------------------------------------------
                find     11.322  (±17.7%) i/s -     56.000
        sort.bsearch     90.577  (±14.4%) i/s -    448.000
             bsearch    712.016k (± 7.1%) i/s -      3.554M

Comparison:
             bsearch:   712015.6 i/s
        sort.bsearch:       90.6 i/s - 7860.84x slower
                find:       11.3 i/s - 62886.82x slower

Calculating -------------------------------------
                find     1.000  i/100ms
           sort.find     1.000  i/100ms
        sort.bsearch     1.000  i/100ms
-------------------------------------------------
                find     11.216  (±17.8%) i/s -     55.000
           sort.find      3.358  (± 0.0%) i/s -     17.000
        sort.bsearch      4.543  (± 0.0%) i/s -     23.000

Comparison:
                find:       11.2 i/s
        sort.bsearch:        4.5 i/s - 2.47x slower
           sort.find:        3.4 i/s - 3.34x slower

tl;dr: find is ~2x faster on unsorted array.

And, looks like my pull-request is useless in this case. Or, maybe anyone have an idea how can i improve it?

Arcovion commented 9 years ago

77_777_777 isn't between 0..1_000_000 so the entire array is iterated over. The other two are slower due to sorting.

require 'benchmark/ips'

DATA         = Array(0..1_000_000)
REVERSE_DATA = Array(0..1_000_000).reverse

def bsearch
  DATA.bsearch{ |number| number > 777_777 }
end

def find
  DATA.find{ |number| number > 777_777 }
end

def reverse_find
  REVERSE_DATA.find{ |number| number > 777_777 }
end

def index
  DATA[DATA.index{ |number| number > 777_777 }]
end

def reverse_index
  REVERSE_DATA[REVERSE_DATA.index{ |number| number > 777_777 }]
end

def rindex
  DATA[DATA.rindex{ |number| number > 777_777 }]
end

p bsearch, find, reverse_find, index, reverse_index, rindex

Benchmark.ips do |bm|
  bm.report('bsearch'){ bsearch }
  bm.report('find'){ find }
  bm.report('reverse_find'){ reverse_find }
  bm.report('index'){ index }
  bm.report('reverse_index'){ reverse_index }
  bm.report('rindex'){ rindex }
  bm.compare!
end

This gives some meaningful results:

777778
777778
1000000
777778
1000000
1000000
Calculating -------------------------------------
             bsearch    68.426k i/100ms
                find     2.000  i/100ms
        reverse_find    51.144k i/100ms
               index     3.000  i/100ms
       reverse_index   159.802k i/100ms
              rindex   159.604k i/100ms
-------------------------------------------------
             bsearch    938.018k (± 1.8%) i/s -      4.721M
                find     21.614  (± 0.0%) i/s -    110.000 
        reverse_find    697.638k (± 1.6%) i/s -      3.529M
               index     31.105  (± 0.0%) i/s -    156.000 
       reverse_index      5.299M (± 1.1%) i/s -     26.527M
              rindex      5.381M (± 1.4%) i/s -     26.973M

Comparison:
              rindex:  5380568.4 i/s
       reverse_index:  5299118.5 i/s - 1.02x slower
             bsearch:   938017.9 i/s - 5.74x slower
        reverse_find:   697638.3 i/s - 7.71x slower
               index:       31.1 i/s - 172982.81x slower
                find:       21.6 i/s - 248943.46x slower

The first number in REVERSE_DATA is 1_000_000, greater than 777_777, so reverse_find should immediately return that. bsearch somehow manages to be faster, however rindex wins doing the same thing as reverse_find. The really slow ones are iterating 777777 times until they find a match.

Basically, if the data is sorted then binary searching is always fastest. Otherwise, it just matters where the element you want is positioned in the array. If you're indexing the data a lot, it's worth sorting once in advance of many binary searches.

TL;DR: No use comparing bsearch and find as it depends whether the data is sorted. find vs index is worth comparing though - I might make a PR for that myself.

winston commented 8 years ago

I would agree that comparing find and bsearch is not very useful actually, given that bsearch only works for sorted array.

Maybe we should also remove that from the README? @JuanitoFatas?

Agree that find and index would be more interesting.

fastruby / fast-ruby

Add Array#sort#bsearch example to bsearch-vs-find #52