Closed janakrajchadha closed 6 years ago
@janakrajchadha Just curious where this issue stood. I know you worked on this in the past, and I kind of had the sense that maybe this was done. It might be worth providing a quick status update here?
@mhucka I had worked on the Python implementations and figured out the most basic way to do this. As you may recall, this discussion was started after observing the slow response time of the PageFreezer API. As I started working on this, a few related questions came up:
I've mentioned in one of my reports that this will have to be kept on hold until we answer these questions and figure out the best way to add caching somewhere in the entire process.
From where I'm sitting:
I don't think the question of what services we are using is particularly relevant here—yes, a service or algorithm that is slow is in particular need of caching, but even a quick algorithm still downloads two files, which is a lot of time that could be skipped with a cache.
When it comes to diffing, the question of how the processing layer fits in is a solved one (diffs are generated on demand for any client by a request through the API server; we don't, for now, do any pregeneration of diffs). edgi-govdata-archiving/web-monitoring#62 is about how we might trigger the processing tools to add annotations to changes, which doesn't have anything to do with diffs. If we ever built analysis tools that actually leveraged the diff tools (none do right now), that might be slightly different, but could easily be solved by having the analysis tools request diffs the same way any client does, thereby using the same cache.
I don't really think anything needs to kept on hold here. For the scope of caching diffs that the diff service would provide to clients, there are two clear places to do it (and we could do it in one or both):
In/in front of the API server. This would be helpful for all kinds of other requests, too! I get the impression that this issue is a bit of a barrier there.
In/in front of the processing project’s diff service. Critically, this should be at least be as far up in the stack as where we choose what diff method is getting dispatched to (i.e. don't cache at the diff function level), so that the cache prevents us from ever retrieving the two versions to be diffed. This actually might be best done in front of the diff server entirely, using an Nginx plugin or something like Varnish.
@janakrajchadha did you ever put the caching work you originally had in edgi-govdata-archiving/web-monitoring-processing#68 in a different branch? You removed it before there was ever much discussion, but IIRC, you had caching at the level of each diff function, and as noted above, it would probably be more effective at an earlier point.
I think we also talked a bit about whether the cache would be better leveraging a centralized storage across processes (since the current service runs 4) by using something like Redis or Memcached. The one you implemented, I think, used in-process memory?
I don't think the question of what services we are using is particularly relevant here
I agree with your point. I think one of the questions that Dan and I had discussed at that point in time was how expensive would caching prove to be in terms of the memory required and will it vary from one service to the other. After putting more thought into it, I think it really depends on us what we want to cache and the important information from the result of any diff service would not vary a lot.
When it comes to diffing, the question of how the processing layer fits in is a solved one
We had considered the possibility of caching all diffs (due to concerns around PF's speed) and weren't sure about how computationally expensive it'll be. Speaking about processing tools leveraging the diff services, I would say that most of my work is dependent on using diffs as the entire project is based on analysing changes. Once we have solved the triggering issue, I think using the same cache for diffs would probably be a good way to go.
Talking about the possible places you've suggested for caching -
In/in front of the API server - I'm not too sure about this as Dan had talked about computing diffs on the same machine instead of accessing the API. We might want to talk about this in detail. Also, I'm not aware of the timeline for the issue you've mentioned. Are we doing it sometime soon?
In/in front of the processing project’s diff service.
This actually might be best done in front of the diff server entirely
If we are computing diffs locally, this can be done in front of the local service, if I'm not wrong. I'm not sure if Nginx or Varnish will be helpful in that case (I have limited knowledge of these two).
did you ever put the caching work you originally had in edgi-govdata-archiving/web-monitoring-processing#68 in a different branch?
I have kept it in another branch on my local. I haven't pushed that branch yet. I had only tried it out on the function which makes call to the PF API as this was happening around the time when our own diffing service had just come up and the first discussion was restricted to PF's speed issues.
The one you implemented, I think, used in-process memory?
Yes, it used in-process memory and handled caching using Least Recently Used (LRU) cache.
@Mr0grog If you can give me an idea of a good way to do this (keeping future additions in mind), I can start working on it. I'm not sure of the right direction here because of my limited knowledge of caching tools and techniques.
In/in front of the API server - I'm not too sure about this as Dan had talked about computing diffs on the same machine instead of accessing the API. We might want to talk about this in detail. Also, I'm not aware of the timeline for the issue you've mentioned. Are we doing it sometime soon?
There is no issue for this other than #64, which is an umbrella issue for a lot of ops/devops-related work. I think caching is among the lowest priority of all things we can be working on right now, so there’s no timeline or plan for it.
If we are computing diffs locally, this can be done in front of the local service, if I'm not wrong. I'm not sure if Nginx or Varnish will be helpful in that case (I have limited knowledge of these two).
I think we are talking about different things when we talk about “in front of the service” here. When I say service, I mean an always-running application that responds to requests via some API (web or otherwise), so the diff service is wm-diffing-server
. Its only API is HTTP. Nginx currently sits in front of it (that is: Nginx proxies all requests to the diff service; the diff service should never be accessed directly except in development). So when I say “in front of the service,” I mean something that is accessed as a proxy to the service and which, more importantly, is generally the only way the service should be accessed [in production]. So something sitting in front of the service is both a separate application and necessarily HTTP-based. (When I say “in,” then it would be in the actual Python code.)
I’m not sure of the right direction here because of my limited knowledge of caching tools and techniques.
If we want to put a cache in front of either the diff service or the monitoring API, either Nginx’s built-in caching tools or Varnish are probably the way to go. Both are reverse proxies and, they way we are using Nginx now, both can do caching and the same load balancing work we already have Nginx doing. I’m not a deep expert on these tools, so I would probably just go ahead and figure out how to configure Nginx’s caching and not worry about Varnish.
If you want to implement your own caching inside of our Python code, I’d look into leveraging either Memcached or Redis—both are mature, stable, and have great client libraries for most every language. An in-process cache like the one you implemented is a good start, but ideally you want want to move the actual cache outside your web service’s process.
You can think of Memcached like a giant hash table that you access via TCP (you generally want it running on a separet machine so that it can consume pretty much all that machine’s memory without causing your service any issues). Figure out what you want to use as a key for that hash table (e.g. the URL of the request or possibly something more concise like "{diff type}:{a hash}:{b hash}"
), then just request that item from the cache. If the cache doesn’t have it, then run the algorithm and put the results back in Memcached using that same key. Redis can be used a lot like Memcached, but is much more flexible and has a lot more features. You should read up on each. Google will have plenty of info on how to do caching with either one.
I think caching is among the lowest priority of all things we can be working on right now, so there’s no timeline or plan for it.
I'm sorry if you got confused, I was talking about the issue related to rewriting the project in Python/JS as you had talked about how it can be a minor barrier.
Also, thanks for clarifying what you mean by an "in front of" service.
So something sitting in front of the service is both a separate application and necessarily HTTP-based.
So from what I understand, this means that we will have to necessarily access the diff service through its web API, right?
Using Nginx seems like a good way to go, given the fact that it has built-in caching tools and we already have it running for load balancing work. We'll probably have to dive deeper and compare it against caching inside the Python code and then take a decision.
Thank you for the super helpful information and production related knowledge. Really appreciate that! 👐
I'm sorry if you got confused, I was talking about the issue related to rewriting the project in Python/JS as you had talked about how it can be a minor barrier.
Oh. To be clear, I don’t think the rewrite is a barrier to anything here. I meant that the fact that it is currently a Ruby app is a barrier to other people adding caching (hence the need to rewrite it). The only timeline is the timeline stated in the issue: it’s gotta happen before I leave, but unless somebody has something to say about it, there are too many other things that need to be worked on first.
So from what I understand, this means that we will have to necessarily access the diff service through its web API, right?
Well, that’s the only API the diff service has! If you want to access the contents of the diff module and not the service, then you have other options ;)
OK, there is caching (with Redis) at the DB level now (has been for a while), so I'm going to go ahead and close this.
There is a need to implement caching of the diff results from the different diffing services as it'll reduce the time it takes to access them. This will also eliminate the need to repeatedly use the API to get the same results.