dry-rb / dry-schema

Coercion and validation for data structures
https://dry-rb.org/gems/dry-schema
MIT License
421 stars 110 forks source link

Make KeyValidator faster by sorting keys and using binary search algorithm (Array#bsearch) #453

Closed radarek closed 1 year ago

radarek commented 1 year ago

I noticed that validating schema is very slow if it contains thousands of keys. Current algorithm is O(N^2), which means that doubling number of keys makes it ~four times slower. I was able to optimize it by sorting keys once and then using Array#bsearch method which is O(N*logN).

Benchmark

Here is benchmark I run before and after the change:

require 'dry/schema'
require 'benchmark'

def sized_benchmark(size, reporter)
  hash = size.times.to_h { [:"key_#{_1}", true] }
  nested = { foo: { bar: { baz: hash } } }

  schema = Dry::Schema.define do
    config.validate_keys = true

    required(:foo).hash do
      required(:bar).hash do
        required(:baz).hash do
          hash.each do |key, _|
            required(key)
          end
        end
      end
    end
  end

  reporter.report("schema call with size = #{size}") do
    schema.call(nested)
  end
end

Benchmark.bmbm do |x|
  sized_benchmark(1, x)
  sized_benchmark(10, x)
  sized_benchmark(100, x)
  sized_benchmark(1000, x)
  sized_benchmark(10000, x)
end

Before:

$ ruby -Ilib/ bm.rb
Rehearsal -----------------------------------------------------------------
schema call with size = 1       0.001126   0.000397   0.001523 (  0.002303)
schema call with size = 10      0.000085   0.000001   0.000086 (  0.000086)
schema call with size = 100     0.001486   0.000013   0.001499 (  0.001502)
schema call with size = 1000    0.145060   0.002052   0.147112 (  0.147127)
schema call with size = 10000  12.251927   0.109480  12.361407 ( 12.384599)
------------------------------------------------------- total: 12.511627sec

                                    user     system      total        real
schema call with size = 1       0.000101   0.000000   0.000101 (  0.000099)
schema call with size = 10      0.000117   0.000001   0.000118 (  0.000117)
schema call with size = 100     0.001314   0.000019   0.001333 (  0.001332)
schema call with size = 1000    0.122066   0.000790   0.122856 (  0.122957)
schema call with size = 10000  12.230255   0.094633  12.324888 ( 12.347312)

After:

$ ruby -Ilib/ bm.rb
Rehearsal -----------------------------------------------------------------
schema call with size = 1       0.001233   0.000475   0.001708 (  0.002473)
schema call with size = 10      0.000080   0.000000   0.000080 (  0.000081)
schema call with size = 100     0.000413   0.000001   0.000414 (  0.000414)
schema call with size = 1000    0.003748   0.000042   0.003790 (  0.003794)
schema call with size = 10000   0.052483   0.000983   0.053466 (  0.054539)
-------------------------------------------------------- total: 0.059458sec

                                    user     system      total        real
schema call with size = 1       0.000099   0.000000   0.000099 (  0.000097)
schema call with size = 10      0.000116   0.000001   0.000117 (  0.000115)
schema call with size = 100     0.000350   0.000000   0.000350 (  0.000350)
schema call with size = 1000    0.002802   0.000096   0.002898 (  0.002894)
schema call with size = 10000   0.032131   0.000347   0.032478 (  0.032612)
solnic commented 1 year ago

Thank you! This is great ❤️ 🚀

radarek commented 1 year ago

@solnic Can you please revert it? This implementation is not correct. I need to rethink some edge cases.

Here is an example of failing specs with this PR but working on main:

it "copies key map from the parent and includes new keys from child" do
  schema = Dry::Schema.Params do
    config.validate_keys = true

    required(:a).filled(:string)
    required(:fooA).filled(:string)
    required(:foo).array(:hash) do
      required(:bar).filled(:string)
    end
  end

  expect(schema.(a: "string", fooA: "string", foo: "string").errors.to_h)
    .to eql({foo: ["must be an array"]})
end

In this PR it produces :foo => ["is not allowed"].

solnic commented 1 year ago

@radarek no worries. Could we have another PR that fixes it instead of reverting?

radarek commented 1 year ago

@solnic If it is not a problem that currently main is broken then I will create new PR. I need some time to make more tests.