Closed wconrad closed 9 years ago
I added a couple comments, otherwise I like it :+1:
One other thing to note. In our app we've historically saved metadata after the after successful use of the client (i.e. if we ran a search
with the client that failed then we wouldn't save the metadata). I don't know if there is a particular reason or whether it just happened that way, but it seems worth mentioning just incase it turns out that there was a justification. That said I think this makes more logical sense and should reduce metadata fetching compared to our approach.
I'm glad you like this. I'll start writing tests, then. I hope I can figure them out--I was a little intimidated by the metadata tests. Lots of stubbing going on there that I'll need to understand, I think.
That said I think this makes more logical sense and should reduce metadata fetching compared to our approach.
Believe it or not, my actual performance goal was to eliminate unneeded marshaling. With this PR, marshaling is only done after a successful fetch, rather than with every run of the program. Although marshaling is much faster than fetching (say, 1.5 seconds instead of 15), it's slower than not marshaling at all.
I'll get rid of the merge conflicts, among other things. I guess a force-push will be called for.
this is cool @wconrad :).
In your guys' experience, how often does the Metadata for an individual MLS change (become out of date) and what usually changes within the metadata?
Exciting things coming, including tests. I've not used minitest before, and I need a little help. I want to run minitest against just one test file, but none of these answers seem to work for this project (usually, all tests are run no matter what I try). The officially documented way of running a test piecemeal runs nothing at all.
For now, I'll just temporarily delete the other files form my working directory so they won't be run. It's ham fisted, but it'll work.
How do you run a single minitest file on this project?
With this force-push, I've separated the concerns of serialization from persistence. The :metadata_cache
option controls only persistence, and the :metadata_serializer
option controls only serialization. There are built-in serializers for Marshal, JSON and YAML, and a built-in cache for disk files.
From the README:
Metadata, which is loaded when a client is first started, can be slow to fetch. To avoid the cost of fetching metadata every time the client is started, metadata can be cached.
To cache metadata, pass the :metadata_cache option to the client when you start it. The library comes with a predefined metadata cache that persists the metadata to a file. It is created with the path to which the cached metadata should be written:
metadata_cache = Rets::Metadata::FileCache.new("/tmp/metadata")
When you create the RETS client, pass it the metadata cache:
client = Rets::Client.new(
...
metadata_cache: metadata_cache
)
If you want to persist to something other than a file, create your own metadata cache object and pass it in. It should have the same interface as the built-in Metadata::FileCache class:
class MyMetadataCache
# Save the metadata. Should yield an IO-like object to a block;
# that block will serialize the metadata to that object.
def save(&block)
end
# Load the metadata. Should yield an IO-like object to a block;
# that block will deserialize the metadata from that object and
# return the metadata. Returns the metadata, or nil if it could
# not be loaded.
def load(&block)
end
end
By default, the metadata is serialized using Marshal. You may select JSON or YAML instead, or define your own serialization mechanism, using the :metadata_serializer option when you create the Rets::Client:
client = Rets::Client.new(
...
metadata_serializer: Rets::Metadata::JsonSerializer
)
The built-in serializers are:
To define your own serializer, create an object with this interface:
class MySerializer
# Serialize to a file. The library reserves the right to change
# the type or contents of o, so don't depend on it being
# anything in particular.
def save(file, o)
end
# Deserialize from a file. If the metadata cannot be
# deserialized, return nil.
def load(file)
end
end
Benchmarks for the different serialization methods. This is with about 500K of metadata:
SAVE LOAD
MARSHAL 0.001 0.001
JSON 0.008 0.012
YAML 0.017 0.019
Marshal beats JSON and YAML by an order of magnitude, but they're all so fast that the difference doesn't matter for this purpose.
I'm really happy with this.
I don't know how @dougcole wants to merge this, whether he wants us to try it out in production first or whether he's happy for it to be merged now.
Likewise, I really like this patch. Would love to get @dougcole's opinion. I definitely think it should mark a new major rets
release, and it's a big improvement!
@hfaulds and @tcrayford - Thanks!
@summera I don't know how often metadata changes. I'm a babe-in-the-woods when it comes to RETS.
If this library is using semantic versioning (is it?), then this PR would cause a minor level version increment: It's a non-breaking (backward compatible) API change.
@wconrad haha no worries! Fairly new to me as well.
@hfaulds @dougcole @tcrayford would you guys happen to know how often the Metadata for an individual MLS changes (become out of date) and what usually changes within the metadata?
Hi @wconrad, thanks for your contributions! I had a few comments:
@phiggins You're welcome, and thank you for reviewing the PR and for your suggestions.
The interaction between the serializers and caches seems odd to me. It seems like the serializer should just be in charge of transforming the data, not doing IO.
Having the serializers work by reading/writing from IO objects (which aren't necessarily files) means that the interface doesn't force serialized metadata to live in memory as one big string, either on the way out or on the way back in. That's good for speed and memory use. The IO based interface is also not a problem for caches that need to work with strings instead of files, as they can use StringIO (you didn't mention this, but it was on my mind when I settled on the IO based interface, so bears mentioning).
Also, if IO was taking out of the serializers and the methods were changed from save/load to dump/load, you could just use the constants Marshal, YAML and JSON instead of having to define a class.
This is a good point. It would greatly simplify the three serializers used here. At first I thought it would make it harder to write a user-defined serializer, but that's just as easy, because the code wouldn't really care whether the serializer it was given was a class or an instance of a class. I like this idea and would do it, but I don't know how to deal with the error handling: Each serialization method should return nil when the metadata could not be deserialized, and that looks a little different for each serializer:
def load(file)
JSON.load(file)
rescue JSON::ParserError
nil
end
vs.
def load(file)
Marshal.load(file)
rescue TypeError
nil
end
vs.
def load(file)
YAML.load(file)
rescue Psych::SyntaxError
nil
end
I suppose the class idea would work if the code just rescued RuntimeError, but I would prefer to catch specific exceptions even if it means a little more code: rescuing RuntimeError gives bugs a great place to hide. Or, the library could accept another option that has the exception or exceptions that should be caught, but that seems akin to the "parallel arrays" anti-pattern: Two things that belong together but are artificially separated.
Do you have any ideas?
@summera metadata changes normally only occur when an mls wants to update existing fields (adding new values to lookup types, changing the number of CHARs allowed in specific fields, etc), add new fields or remove fields. It's completely dependent on the mls, but I think roughly once every few months is pretty common.
I have seen mlses that update more often than that, although most of those seems to be technical bugs on their end rather than valuable updates. We also have one mls we work with where a bug in the rets
library causes the metadata to be invalidated on every fetch. I'll open a issue for that as I just realized it doesn't have an open issue.
This is looking great! I'll wait for @wconrad and @phiggins to finish discussing the code, but I'd vote for running this branch in production for a bit before merging. The test coverage in here is pretty good, but there are enough crazy edge cases in production rets servers that I don't completely trust our test suite to catch everything.
@dougcole Thanks for reviewing it. I'm glad you like it. I agree with some use in production being good, and not just for edge cases: The test coverage is good except that there is no test that covers the changes to the Client class. The tests I added are only for all the new classes.
This has got to be one of the most vibrant communities I've run into on github. Reviewing and improving code with y'all is quite a bit of fun.
@dougcole Thanks for the info!
@dougcole When consuming rets data, are you guys dynamically looking up rets fields, lookup type values, etc. based on your cached metadata?
@wconrad:
Having the serializers work by reading/writing from IO objects (which aren't necessarily files) means that the interface doesn't force serialized metadata to live in memory as one big string, either on the way out or on the way back in. That's good for speed and memory use.
This is definitely true, but is this a problem you're running into or just thinking ahead? The serializers doing IO is not a blocker for me, I thought it was worth mentioning in the interest of more straightforward code.
Each serialization method should return nil when the metadata could not be deserialized
Should they? It seems like swallowing exceptions around serializing/deserializing could hide problems with your caching mechanism.
@phiggins:
... is this a problem you're running into or just thinking ahead?
I was thinking ahead (a polite euphamism for "optimizing prematurely"), so let's have some benchmarks. This is for Metrolists's metdata, which is about 500K. I tested each of the three serialization methods, and for each method, I tested it using the "serialize directly to a file" (aka "io") method, and the "serialize to a string" method.
load Marshal io 4.800k (± 4.0%) i/s - 95.872k
load Marshal string 6.559k (± 4.9%) i/s - 131.253k
load Json io 185.867 (± 2.2%) i/s - 3.726k
load Json string 187.073 (± 2.1%) i/s - 3.744k
load Yaml io 123.938 (± 2.4%) i/s - 2.484k
load Yaml string 96.608 (± 2.1%) i/s - 1.935k
save Marshal io 240.324 (±32.5%) i/s - 3.280k
save Marshal string 240.081 (±34.6%) i/s - 3.140k
save Json io 121.222 (±19.8%) i/s - 2.296k
save Json string 145.500 (±33.0%) i/s - 2.508k
save Yaml io 82.326 (±17.0%) i/s - 1.561k
save Yaml string 109.302 (±14.6%) i/s - 2.120k
Serializing using a string is usually slower, sometimes faster; sometimes by a lot, sometimes not by much. However, all of the serialization methods are so fast that it really doesn't matter. Given that, I think the simpler code (string) should win.
It seems like swallowing exceptions around serializing/deserializing could hide problems with your caching mechanism.
The primary case for swallowing those exceptions is for when the programmer switches from one serialization method to another, which causes a deserialization failure. I found it convenient to have such a switch be a non event.
I think something could be logged on a deserialization failure. There is some existing logging of metadata cache events via the ClientProgressReporter class. This PR makes the existing logging not-quite-right. It ought to be reworked, and that could include logging deserialization failures.
@phiggins, I've mostly converted the code to use String-based serializers, but hit a snag in the client tests. Using IO-based serializers, the serializers aren't called by NullCache (the default cache). Using String-based serializes, the serializers are always called. This matters because the cache tests are using mock metadata, which can't be serialized. At that point, I lost heart. I have to admit that I'm not highly motivated to make this change, since I don't consider the IO-based serializers to be all that complex.
@wconrad It seems like everyone's in favor of the code as it stands, and if it's an improvement we should :shipit:.
fwiw I agree with @phiggins: increasing the ease of switching serializers seems like a small thing compared to hiding potential bugs.
What about catching the exception, deleting the cached value and then re-raising the exception? That would let the system work quickly after changing serialization methods without too much pain, but wouldn't hide existing problems too much.
I also agree that this is already a big improvement and if you're starting to burn out on code review @wconrad just say so, I think it's safe to merge as is and improve over time.
@dougcole I've got a little steam left :).
Having the program die the first time after changing serializers, and then work the next time, would confuse me if I didn't know better. As a first-time user of the library, I'd file an issue against that, and then someone would have to tell me I should have noticed the note in the README that it was expected behavior.
Although I think the possibility of bugs hiding in this particular rescue is low, I do agree with you and @phiggins that the possibility is there. I also want to make serialization failures visible, but I want to do it differently: Instead of killing the program, I want to log the failure. To that end, what do you think of this idea that I brought up earlier?
I think something could be logged on a deserialization failure. There is some existing logging of metadata cache events via the ClientProgressReporter class. This PR makes the existing logging not-quite-right. It ought to be reworked, and that could include logging deserialization failures.
I finally gotten around to creating a branch of our code using this. We should be able to test this in production over the next few days.
This has been running fine in production without any problems as far as I can see.
Outstanding, thanks!
NOTE: There's been a force-push; the README section shown here has changed. Please scroll down to see the latest version
This is an attempt to make metadata caching easier and more efficient. From the README:
Metadata, which is loaded when a client is first started, can be slow to fetch. To avoid the cost of fetching metadata every time the client is started, metadata can be cached.
To cache metadata, pass the :metadata_cache option to the client when you start it. The library comes with a predefined metadata cache that persists the metadata to a file. It is created with the path to which the cached metadata should be written:
When you create the RETS client, pass it the metadata cache:
If you want to persist to something other than a file, create your own metadata cache object and pass it in. It should have the same interface as the built-in FileCache:
See the source for Rets::Metadata::FileCache for more.
I've written no tests for this (although it seems to work fine). I wanted to find out whether you agree with this approach before trying to write tests.