estately / rets

A pure-ruby library for fetching data from RETS servers
https://github.com/estately/rets
127 stars 94 forks source link

:metadata_cache option for easier metadata caching #134

Closed wconrad closed 9 years ago

wconrad commented 9 years ago

NOTE: There's been a force-push; the README section shown here has changed. Please scroll down to see the latest version

This is an attempt to make metadata caching easier and more efficient. From the README:


Metadata, which is loaded when a client is first started, can be slow to fetch. To avoid the cost of fetching metadata every time the client is started, metadata can be cached.

To cache metadata, pass the :metadata_cache option to the client when you start it. The library comes with a predefined metadata cache that persists the metadata to a file. It is created with the path to which the cached metadata should be written:

metadata_cache = Rets::Metadata::FileCache.new("/tmp/metadata")

When you create the RETS client, pass it the metadata cache:

client = Rets::Client.new(
  login_url: @login_url,
  # ... etc
  metadata_cache: metadata_cache
)

If you want to persist to something other than a file, create your own metadata cache object and pass it in. It should have the same interface as the built-in FileCache:

class MyMetadataCache

  # Save the metadata.
  def save(metadata)
  end

  # Load the metadata.  Return nil if it could not be loaded.
  def load
  end

end

See the source for Rets::Metadata::FileCache for more.


I've written no tests for this (although it seems to work fine). I wanted to find out whether you agree with this approach before trying to write tests.

hfaulds commented 9 years ago

I added a couple comments, otherwise I like it :+1:

One other thing to note. In our app we've historically saved metadata after the after successful use of the client (i.e. if we ran a search with the client that failed then we wouldn't save the metadata). I don't know if there is a particular reason or whether it just happened that way, but it seems worth mentioning just incase it turns out that there was a justification. That said I think this makes more logical sense and should reduce metadata fetching compared to our approach.

wconrad commented 9 years ago

I'm glad you like this. I'll start writing tests, then. I hope I can figure them out--I was a little intimidated by the metadata tests. Lots of stubbing going on there that I'll need to understand, I think.

That said I think this makes more logical sense and should reduce metadata fetching compared to our approach.

Believe it or not, my actual performance goal was to eliminate unneeded marshaling. With this PR, marshaling is only done after a successful fetch, rather than with every run of the program. Although marshaling is much faster than fetching (say, 1.5 seconds instead of 15), it's slower than not marshaling at all.

wconrad commented 9 years ago

I'll get rid of the merge conflicts, among other things. I guess a force-push will be called for.

summera commented 9 years ago

this is cool @wconrad :).

In your guys' experience, how often does the Metadata for an individual MLS change (become out of date) and what usually changes within the metadata?

wconrad commented 9 years ago

Exciting things coming, including tests. I've not used minitest before, and I need a little help. I want to run minitest against just one test file, but none of these answers seem to work for this project (usually, all tests are run no matter what I try). The officially documented way of running a test piecemeal runs nothing at all.

For now, I'll just temporarily delete the other files form my working directory so they won't be run. It's ham fisted, but it'll work.

How do you run a single minitest file on this project?

wconrad commented 9 years ago

With this force-push, I've separated the concerns of serialization from persistence. The :metadata_cache option controls only persistence, and the :metadata_serializer option controls only serialization. There are built-in serializers for Marshal, JSON and YAML, and a built-in cache for disk files.

From the README:


Metadata, which is loaded when a client is first started, can be slow to fetch. To avoid the cost of fetching metadata every time the client is started, metadata can be cached.

To cache metadata, pass the :metadata_cache option to the client when you start it. The library comes with a predefined metadata cache that persists the metadata to a file. It is created with the path to which the cached metadata should be written:

metadata_cache = Rets::Metadata::FileCache.new("/tmp/metadata")

When you create the RETS client, pass it the metadata cache:

client = Rets::Client.new(
  ...
  metadata_cache: metadata_cache
)

If you want to persist to something other than a file, create your own metadata cache object and pass it in. It should have the same interface as the built-in Metadata::FileCache class:

class MyMetadataCache

  # Save the metadata.  Should yield an IO-like object to a block;
  # that block will serialize the metadata to that object.
  def save(&block)
  end

  # Load the metadata.  Should yield an IO-like object to a block;
  # that block will deserialize the metadata from that object and
  # return the metadata.  Returns the metadata, or nil if it could
  # not be loaded.
  def load(&block)
  end

end

By default, the metadata is serialized using Marshal. You may select JSON or YAML instead, or define your own serialization mechanism, using the :metadata_serializer option when you create the Rets::Client:

client = Rets::Client.new(
  ...
  metadata_serializer: Rets::Metadata::JsonSerializer
)

The built-in serializers are:

To define your own serializer, create an object with this interface:

class MySerializer

  # Serialize to a file.  The library reserves the right to change
  # the type or contents of o, so don't depend on it being
  # anything in particular.
  def save(file, o)
  end

  # Deserialize from a file.  If the metadata cannot be
  # deserialized, return nil.
  def load(file)
  end

end
wconrad commented 9 years ago

Benchmarks for the different serialization methods. This is with about 500K of metadata:

           SAVE      LOAD
MARSHAL    0.001    0.001
JSON       0.008    0.012
YAML       0.017    0.019

Marshal beats JSON and YAML by an order of magnitude, but they're all so fast that the difference doesn't matter for this purpose.

hfaulds commented 9 years ago

I'm really happy with this.

I don't know how @dougcole wants to merge this, whether he wants us to try it out in production first or whether he's happy for it to be merged now.

tcrayford commented 9 years ago

Likewise, I really like this patch. Would love to get @dougcole's opinion. I definitely think it should mark a new major rets release, and it's a big improvement!

wconrad commented 9 years ago

@hfaulds and @tcrayford - Thanks!

@summera I don't know how often metadata changes. I'm a babe-in-the-woods when it comes to RETS.

If this library is using semantic versioning (is it?), then this PR would cause a minor level version increment: It's a non-breaking (backward compatible) API change.

summera commented 9 years ago

@wconrad haha no worries! Fairly new to me as well.

@hfaulds @dougcole @tcrayford would you guys happen to know how often the Metadata for an individual MLS changes (become out of date) and what usually changes within the metadata?

phiggins commented 9 years ago

Hi @wconrad, thanks for your contributions! I had a few comments:

  1. The interaction between the serializers and caches seems odd to me. It seems like the serializer should just be in charge of transforming the data, not doing IO.
  2. Also, if IO was taking out of the serializers and the methods were changed from save/load to dump/load, you could just use the constants Marshal, YAML and JSON instead of having to define a class.
wconrad commented 9 years ago

@phiggins You're welcome, and thank you for reviewing the PR and for your suggestions.

The interaction between the serializers and caches seems odd to me. It seems like the serializer should just be in charge of transforming the data, not doing IO.

Having the serializers work by reading/writing from IO objects (which aren't necessarily files) means that the interface doesn't force serialized metadata to live in memory as one big string, either on the way out or on the way back in. That's good for speed and memory use. The IO based interface is also not a problem for caches that need to work with strings instead of files, as they can use StringIO (you didn't mention this, but it was on my mind when I settled on the IO based interface, so bears mentioning).

Also, if IO was taking out of the serializers and the methods were changed from save/load to dump/load, you could just use the constants Marshal, YAML and JSON instead of having to define a class.

This is a good point. It would greatly simplify the three serializers used here. At first I thought it would make it harder to write a user-defined serializer, but that's just as easy, because the code wouldn't really care whether the serializer it was given was a class or an instance of a class. I like this idea and would do it, but I don't know how to deal with the error handling: Each serialization method should return nil when the metadata could not be deserialized, and that looks a little different for each serializer:

  def load(file)
    JSON.load(file)
  rescue JSON::ParserError
    nil
  end

vs.

  def load(file)
    Marshal.load(file)
  rescue TypeError
    nil
  end

vs.

  def load(file)
    YAML.load(file)
  rescue Psych::SyntaxError
    nil
  end

I suppose the class idea would work if the code just rescued RuntimeError, but I would prefer to catch specific exceptions even if it means a little more code: rescuing RuntimeError gives bugs a great place to hide. Or, the library could accept another option that has the exception or exceptions that should be caught, but that seems akin to the "parallel arrays" anti-pattern: Two things that belong together but are artificially separated.

Do you have any ideas?

dougcole commented 9 years ago

@summera metadata changes normally only occur when an mls wants to update existing fields (adding new values to lookup types, changing the number of CHARs allowed in specific fields, etc), add new fields or remove fields. It's completely dependent on the mls, but I think roughly once every few months is pretty common.

I have seen mlses that update more often than that, although most of those seems to be technical bugs on their end rather than valuable updates. We also have one mls we work with where a bug in the rets library causes the metadata to be invalidated on every fetch. I'll open a issue for that as I just realized it doesn't have an open issue.

dougcole commented 9 years ago

This is looking great! I'll wait for @wconrad and @phiggins to finish discussing the code, but I'd vote for running this branch in production for a bit before merging. The test coverage in here is pretty good, but there are enough crazy edge cases in production rets servers that I don't completely trust our test suite to catch everything.

wconrad commented 9 years ago

@dougcole Thanks for reviewing it. I'm glad you like it. I agree with some use in production being good, and not just for edge cases: The test coverage is good except that there is no test that covers the changes to the Client class. The tests I added are only for all the new classes.

This has got to be one of the most vibrant communities I've run into on github. Reviewing and improving code with y'all is quite a bit of fun.

summera commented 9 years ago

@dougcole Thanks for the info!

summera commented 9 years ago

@dougcole When consuming rets data, are you guys dynamically looking up rets fields, lookup type values, etc. based on your cached metadata?

phiggins commented 9 years ago

@wconrad:

Having the serializers work by reading/writing from IO objects (which aren't necessarily files) means that the interface doesn't force serialized metadata to live in memory as one big string, either on the way out or on the way back in. That's good for speed and memory use.

This is definitely true, but is this a problem you're running into or just thinking ahead? The serializers doing IO is not a blocker for me, I thought it was worth mentioning in the interest of more straightforward code.

Each serialization method should return nil when the metadata could not be deserialized

Should they? It seems like swallowing exceptions around serializing/deserializing could hide problems with your caching mechanism.

wconrad commented 9 years ago

@phiggins:

... is this a problem you're running into or just thinking ahead?

I was thinking ahead (a polite euphamism for "optimizing prematurely"), so let's have some benchmarks. This is for Metrolists's metdata, which is about 500K. I tested each of the three serialization methods, and for each method, I tested it using the "serialize directly to a file" (aka "io") method, and the "serialize to a string" method.

     load Marshal io      4.800k (± 4.0%) i/s -     95.872k
 load Marshal string      6.559k (± 4.9%) i/s -    131.253k
        load Json io    185.867  (± 2.2%) i/s -      3.726k
    load Json string    187.073  (± 2.1%) i/s -      3.744k
        load Yaml io    123.938  (± 2.4%) i/s -      2.484k
    load Yaml string     96.608  (± 2.1%) i/s -      1.935k
     save Marshal io    240.324  (±32.5%) i/s -      3.280k
 save Marshal string    240.081  (±34.6%) i/s -      3.140k
        save Json io    121.222  (±19.8%) i/s -      2.296k
    save Json string    145.500  (±33.0%) i/s -      2.508k
        save Yaml io     82.326  (±17.0%) i/s -      1.561k
    save Yaml string    109.302  (±14.6%) i/s -      2.120k

Serializing using a string is usually slower, sometimes faster; sometimes by a lot, sometimes not by much. However, all of the serialization methods are so fast that it really doesn't matter. Given that, I think the simpler code (string) should win.

It seems like swallowing exceptions around serializing/deserializing could hide problems with your caching mechanism.

The primary case for swallowing those exceptions is for when the programmer switches from one serialization method to another, which causes a deserialization failure. I found it convenient to have such a switch be a non event.

I think something could be logged on a deserialization failure. There is some existing logging of metadata cache events via the ClientProgressReporter class. This PR makes the existing logging not-quite-right. It ought to be reworked, and that could include logging deserialization failures.

wconrad commented 9 years ago

@phiggins, I've mostly converted the code to use String-based serializers, but hit a snag in the client tests. Using IO-based serializers, the serializers aren't called by NullCache (the default cache). Using String-based serializes, the serializers are always called. This matters because the cache tests are using mock metadata, which can't be serialized. At that point, I lost heart. I have to admit that I'm not highly motivated to make this change, since I don't consider the IO-based serializers to be all that complex.

phiggins commented 9 years ago

@wconrad It seems like everyone's in favor of the code as it stands, and if it's an improvement we should :shipit:.

dougcole commented 9 years ago

fwiw I agree with @phiggins: increasing the ease of switching serializers seems like a small thing compared to hiding potential bugs.

What about catching the exception, deleting the cached value and then re-raising the exception? That would let the system work quickly after changing serialization methods without too much pain, but wouldn't hide existing problems too much.

dougcole commented 9 years ago

I also agree that this is already a big improvement and if you're starting to burn out on code review @wconrad just say so, I think it's safe to merge as is and improve over time.

wconrad commented 9 years ago

@dougcole I've got a little steam left :).

Having the program die the first time after changing serializers, and then work the next time, would confuse me if I didn't know better. As a first-time user of the library, I'd file an issue against that, and then someone would have to tell me I should have noticed the note in the README that it was expected behavior.

Although I think the possibility of bugs hiding in this particular rescue is low, I do agree with you and @phiggins that the possibility is there. I also want to make serialization failures visible, but I want to do it differently: Instead of killing the program, I want to log the failure. To that end, what do you think of this idea that I brought up earlier?

I think something could be logged on a deserialization failure. There is some existing logging of metadata cache events via the ClientProgressReporter class. This PR makes the existing logging not-quite-right. It ought to be reworked, and that could include logging deserialization failures.

hfaulds commented 9 years ago

I finally gotten around to creating a branch of our code using this. We should be able to test this in production over the next few days.

hfaulds commented 9 years ago

This has been running fine in production without any problems as far as I can see.

wconrad commented 9 years ago

Outstanding, thanks!