test speed? Has it improved with dataset?

jeroenvandijk commented 12 years ago

in the the README I find the following text:

The goal is to see a marked improvement in overall test run speed, basing this on the assumption that it is faster to have the OS
copy a file or mySQL dump and load. Of course, we may find this to be a false assumption, but there were plenty of bugs in the    
former 'Scenarios' - addressing that afforded the opportunity to test the assumption.

Did you find answer to that assumption? I believe that the dataset approach can increase my test suite's speed greatly, but I would love to hear your experience?

Thanks a lot,

Jeroen

NB. I still have to integrate dataset into my application's test suite therefore it would be nice to get an answer on the above before i go into all the effort.

aiwilliams commented 12 years ago

Hello! I would say that things are faster, but there has never been an opportunity to see a large project go from not using dataset - and having some performance metrics - to using dataset and seeing those metrics improve. Also, I have not used dataset on a new project in a good while - only because I have not had a new project that could make use of it, having been on one project for so long and now working a lot with iOS and heavy JavaScript applications. This to say, I seem to recall there being an issue where dataset might have a bug where dumps are not being used when they should be in certain situations, those where nested example/test contexts add data to a set created in an outer test scope (implemented as a class heirarchy in rspec 1 and Test::Unit).

So, I would suggest a small experiment, and make sure things operate with your versions of gems, paying attention to the dump files that are created in the tmp directory.

jeroenvandijk commented 12 years ago

Hi Adam,

Thanks a lot for the feedback. Happy to hear that you feel its faster.

I'm currently working on some test cases where I need lots of data for each scenario. So currently I'm creating all the ActiveRecord objects before each test and save them to the database. Incredibly slow obviously. The alternatives are playing with transactions, fixture data and dataset. I believe the approach of dataset would be the fastest if I understand the idea correctly.

My understanding is that you load a database dump if it exists, and otherwise create the dataset the normal way if there is no database dump and than save that database dump for follow up runs. The database dump only needs to be updated when the dataset changes. So the win is that you don't have to create many ruby objects and you don't have to do many inserts for these scenarios.

I'll play with it and see if I can contribute back to get it up to date with the latest gems.

aiwilliams commented 12 years ago

If you load a dataset like this, where the name is resolved to a dataset subclass:

describe "my stuff" do
  dataset :very_expensive_inserts_with_validations_and_everything
end

Then your understanding is correct - that dataset will run, all the Ruby code will insert stuff, and then a sql dump will be made. Any other code that wants the :very_expensive_inserts_with_validations_and_everything data will get the dump loaded, instead of having all the Ruby code executed.

Another thing dataset supports is this:

describe "my stuff" do
  dataset :expensive do
    MyActiveRecord.new.save
  end
end

Here is where my memory begins to fail me :) I believe this will load the dump of :expensive, and then run the block for each test. It MAY also create a dump for this block, but if it does, I think now that is not ideal, as you may create instance variables in the blocks, as I wanted them to act like before :each blocks.

You've rekindled my interest. I think this thing could be awesome if I had time to work on it :)

jeroenvandijk commented 12 years ago

I just did a quick test for my self:

  file = Rails.root.join("tmp/app_fixtures.sql")
  if File.exist?(file)
     time = Benchmark.realtime { `psql -U jvandijk -p ''  -e my_database < #{file}` }
     puts "Took #{time} to load the sql"
  else
    time = Benchmark.realtime {
      # Insert all the data using ruby code
      # .....
    }
    puts "Took #{time} to do it with code"

    # dump the data for following runs
    `pg_dump -a my_database > #{file}` }
  end

Output:

  Took 13.089034795761108 to do it with code
  Took 0.036692142486572266 to load the sql

So that is really promising! However note that I did something different than dataset (which dumps the whole database structure as well: pg_dump -c ...). Instead of recreating the whole database I only save the data that was inserted. This is a relatively big difference as well:

Took 0.4327859878540039 to load the sql

And since it doesn't delete the whole database it is more flexible as well, because you could do data inserts before loading the dataset if you felt like it.

Do you see downsides to the above approach?

Jeroen

jeroenvandijk commented 12 years ago

Sorry the above wasn't a reply to your reply :), I didn't refresh.

I do think that your initiative has a lot of value and I think it is a waste that it doesn't seem to be used a lot (I remember seeing it a while ago and therefore I found it again). When I first saw it I didn't see the huge value because I didn't grasp that the main advantage was speed. I thought it was just about reusing datasets which you can do with stuff like machinist pretty ok too. Now, I had a really slow test suite and I figured I needed something like you had already built :)

Happy that i've rekindled your interest :) I hope to contribute soon :)

Jeroen

aiwilliams commented 12 years ago

Wow, that is quite a significant difference in performance. I suddenly realize how silly I was to not start and end with performance metrics - who would build a performance-focused tool without them! Live and learn :)

Re: Database schema dump/load, I don't think it would be a problem to assume that the schema remains constant during a test run. In fact, I can't think at the moment why anyone would want to reload the schema, and may even consider it a bug that it occurs in dataset.

Re: Value of dataset, I do believe that the only value is in the performance space. Dataset should probably be changed to support existing factory APIs, thereby allowing the marketing and implementation to be focused on it's core value add.

Thanks for the interaction and encouragement!

jeroenvandijk commented 12 years ago

Cool! I'm happy to encourage you :)

I was also thinking that it might be smart to move the TestUnit, RSpec and Cucumber adapters to a different library say e.g. dataset-testunit, dataset-rspec, dataset-cucumber. That way a lot of complexity in the code and tests can be removed and the core has less chance of getting out of date.

I have some more ideas but I'll keep them for later :) This is already a lot of work I guess. If you are ok with the above I'll start working on that soon to get the core up to date.

aiwilliams commented 12 years ago

That sounds like a great idea to me. I really hope this proves valuable for you, and that you have fun :)

jeroenvandijk commented 12 years ago

I have started to speed up my own test suite locally by introducing some hacks. Here is my latest hack which could work really well for an activerecord adapter (it works perfectly for my slow scenario):

sql_file = Rails.root.join("tmp/data")
if File.exist?(sql_file)
  puts Benchmark.realtime {
  ActiveRecord::Base.connection.execute(File.read(sql_file))
}
else
  captured_sql = []
  subscription = ActiveSupport::Notifications.subscribe('sql.active_record') do |name, start, finish, id, payload|
    if payload[:sql] =~ /^INSERT/
      captured_sql << payload[:sql]
    end
  end

  # Create many objects here

  transformed_sql = captured_sql.join(';')
  ActiveSupport::Notifications.unsubscribe(subscription)
  File.open(sql_file, 'w') {|f| f.write(transformed_sql) }
end

The beauty in the above code is that you don't have to dump the complete database and it is easy to find out what to insert because you just put a logger around the code. I guess it could by even more optimized by translating the above to a COPY statement (for postgresql), but I'm not sure how much the improvement would be. This solution is as fast as the fastest pg_dump alternative (around 0.03s)!

Cool! I'm going to implement this hack into my application and see how far I can push and than port it back to a more dataset like kind of library.

aiwilliams commented 12 years ago

I love it! Can hardly wait to hear how far it will carry you.

jeroenvandijk commented 12 years ago

My progress so far: https://gist.github.com/1097294

It already works quite nicely. I have added a small TODO list to it for things that I think need to be done. Do you think I could reuse your test suite?

aiwilliams commented 12 years ago

Excellent, and succinct. I'd have to review the tests, but I would guess that at least they could provide some indication of things that should be considered. I am taking a vacation next week. Perhaps I can spend some time digging into this project again.

aiwilliams / dataset

test speed? Has it improved with dataset? #3