Would it be possible to implement caching?

whitfin commented 8 years ago

A lot of the Maxmind implementations contain caching to avoid multiple lookups (which happens often when relying on IP). Is there such a caching implementation in Geolix? I've taken a look and I can't really see anything (forgive me if I'm missing something) so it might be a good idea to go for something like that. I'm not really sure of the implications of hitting cache because I don't have an accurate benchmark of Geolix at the moment. A cache hit would take a couple of microseconds at the most.

Also as a bit of self promotion, you could potentially use https://github.com/zackehh/cachex.

mneudert commented 8 years ago

tl;dr first: No caching at the moment and not really possible right now but that is changing with the planned/ongoing "adapterization".

Right now there is nothing in here that caches anything. All queries are done live using the database contents stored in an Agent. Should be taking only single digit milliseconds that way. Depends on the environment of course, but that is what I found so far. And just hooking your own cache instance in front is probably not what you are looking for, so I silently ignore that possibility ;)

The question(s) to ask in regard of introducing caching as a first-class feature is not what library to use but "where to call the cache" and "what should be cached". Last one is easy: the result of a look-up.

But where should that be checked?

That might depend on the cache keys to use. The full IP is no good choice because of the way the database works. If you look up "10.10.10.10" you might yield the result for "10.0.0.0" because that is the point at where no further segments have different details. So we look at a special key that is different from the queried IP.

Saves a bunch of space in the cache!

And then we have queries from different IP ranges with the same result. Using above example that might be "127.0.0.0". So to further optimize the cache we shouldn't use the IP at all but some sort of "result identification". The MMDB format already provides that with "the byte where the result is stored".

Easy to use and building a separate IP look-up tree while hoping to outperform the already existing one is just nuts. Especially if the existing tree still needs to be traversed for uncached entries.

And where to integrate that? At the moment only the look-up logic itself has all the information needed. There is no way to fetch the mentioned "data start byte" or to just look up something at that arbitrary location.

Now we have the "where" and "what". But I still have to say: not really possible at the moment.

Why that? I am currently refactoring everything to be adapter based. The big plan is to support different database formats while providing a generic usage and management. Once that is done caching would be as simple as just wrapping the adapter with any cache you like.

But until then there is quite some work to do like separate supervisor hierarchies for each individual databases ("city" and "country" being individual, no matter the adapter) while not breaking the reload logic. Only the "read a database into storage" and "look up this ip" should be done by the adapter itself.

By the way, I am definitely interested in some benchmarks if you get your hands on any numbers ;)

whitfin commented 8 years ago

So, single digit milliseconds is usually good enough - I agree. In my use case I'm aiming at sub-millisecond and Geolix is the place where I felt I could optimize further. I'm willing to put a cache on my end but I wanted to see if you have any plans for the lib itself (mainly out of curiosity).

I hadn't considered that same IPs can have the same result (obviously they can), so that does make it a bit of a pain. Is it the lookup logic itself that takes those milliseconds? Or is it the reading back of the format?

mneudert commented 8 years ago

I think the biggest part where the time goes into is the actual data reading. After getting the point in the data part to read from everything gets split and read and split and read and... With the city database providing some 50MB of data getting split and pattern matched dozens of times.

I haven't really benchmarked the individual parts so perhaps the agents storing the metadata/tree/lookup information may be "not optimal". Depends on the concurrency the agents have to work with and the actual message passing going on when moving the data from the agent to the pool worker.

One thing that might shave some microseconds for you could be requesting the results as :raw. Working with raw maps instead of structs should not make a big difference when handling the results.

Caching with the full IP might be worth it if there are lots of requests from the same place. Or just stuff the result in the plug session if you have something like that. For a more long term it might be okay to have an actively expiring cache entry for some like 30 minutes or so for each IP requested. That of course heavily depends on the use case, the query pattern and the environment (enough memory for thousands of entries?)...

oschwald commented 8 years ago

FWIW, with the Java and Go readers, most of the time is spent decoding the data section, not the tree. The data section on most databases is relatively small compared to the tree and it is pretty easy to cache data based on the offset, as you suggest. With the Java reader, we found that a relatively small cache of a couple MBs sped up the reader by 3x or so.

mneudert commented 8 years ago

@zackehh I played around a bit with integrating a Cache into the current code and came up with a "hacky" but usable way: geolix_cachehack.

There are however some caveats with that repo.

Modifying the database lookup meant to deactivate the pooling. Probably no big deal if nothing throws (and it shouldn't...) but still something to keep in mind. If pooling is necessary you could just wrap your own pool around it.

These are the times I got using a plain Agent for caching:

current logic with pool and no cache: around 250 μs
skipping the pool and directly querying: around 200 μs
cached reply: around 100 μs

Your mileage will vary but the percentages between the different strategies should be a guideline if it is worth the effort of the changes. Or at least worth trying with a more suitable benchmark incorporating lots of different IPs and concurrency and stuff.

Until the changes for different adapters are completed that is probably the best you can do to lower the lookup times.

mneudert commented 8 years ago

@zackehh So...

It has been "some time" since this got some attention but it might be worth it now. If you look at master there is a pretty complete adapter split in place that provides enough hooks for caching.

I have also updated the geolix_cachehack with a proper caching adapter. That might even fit the needs you had if you just replace the cache with something more sophisticated than just a plain Agent. Don't have any benchmarks at hand right now but I am sure to remember the cache halfed the response times.

Care to take a look after all this time?

mneudert commented 7 years ago

I think the issue at hand here is "solved" after the just happened release of the adapter stuff to hex. If anything is still unclear please do not hesitate to comment here and/or reopen this issue.

whitfin commented 7 years ago

Hi @mneudert!

Sorry, I missed the previous comment asking for feedback. I'm taking a look but it already appears like anything I require is going to be possible; I might add a PR for the sophisticated cache adapter you mentioned (if it requires it).

Thank you for your great work on this, it looks excellent :)

elixir-geolix / geolix

Would it be possible to implement caching? #13