Closed ttys0dev closed 1 year ago
Since you are using k8s, have you considered using nginx-ingress? If so, nginx will do the caching you are looking for. I have done it once, in the past, for a bare-metal server.
Thanks for your persistence, @ttys0dev. I think AWS's application load balancer tracks timeouts. I'll check when I get a sec, and do more research on switching the gunicorn worker model.
Many of the fibers of my being resist adding another layer to our architecture, if we can avoid it, @troglodite2 :)
Since you are using k8s, have you considered using nginx-ingress? If so, nginx will do the caching you are looking for.
Agree that's a good option, nginx has good performance and lots of flexibility for handling this sort of thing.
I'll check when I get a sec, and do more research on switching the gunicorn worker model.
This is a good stopgap but probably isn't the best long term option, using a nginx cache is going to have much better performance overall since it doesn't use the relatively ineffecient thread-per-connection model like non-asyncio django does for handling IO.
Many of the fibers of my being resist adding another layer to our architecture, if we can avoid it, @troglodite2 :)
One potential option is to migrate the django application to an asyncio request model which is conceptually similar to how something like nginx processes requests, this is not a trivial change however as you would effectively need to migrate from WSGI to ASGI and likely make other changes to avoid the synchronous codepaths.
So I've been hesitant to fix this issue for a few reasons:
I don't have evidence from the field that it's a real problem. For example, when I look at our 504 errors from the ALB, there are about nine of them per week, unless the server is overwhelmed (but that's another story). Nine isn't great, and I'd like for it to not be an issue, but it's not a lot, I'll say that much.
It's a risky change that's hard to test in advance. The Gunicorn docker build isn't normally part of our docker compose, so testing it takes a bit of work.
And if we do make a mistake, it's likely to crash the front end, which is scary.
So I guess I'm sort of +0 on this. It seems like a good idea, and I'm pretty convinced y'all know what you're talking about, but I'd like to either have instructions to test it myself or to have somebody say they built the image with the tweak and it worked fine.
I hope that makes sense and explains my trepidation here.
- For example, when I look at our 504 errors from the ALB, there are about nine of them per week, unless the server is overwhelmed (but that's another story). Nine isn't great, and I'd like for it to not be an issue, but it's not a lot, I'll say that much.
From my reading of the docs a 504 error is an idle connection timeout error, that sort of timeout is from my understanding unrelated to the gunicorn 3 minute worker timeout issue as the timeout there is for a non-idle connection(ie upload timeout does not trigger due to the connecting being idle here but rather due to the worker timeout terminating a non-idle connection).
- It's a risky change that's hard to test in advance. The Gunicorn docker build isn't normally part of our docker compose, so testing it takes a bit of work.
How are changes to the gunicorn docker build normally tested?
I'd like to either have instructions to test it myself or to have somebody say they built the image with the tweak and it worked fine.
I'll try and test things out locally.
From my reading of the docs a 504 error is an idle connection timeout error
I could be wrong here, but I wasn't sure, so I checked what status code it was by going to a page on CL that I know times out, and it returned a 504?
How are changes to the gunicorn docker build normally tested?
Just by hand. They're rarely changed.
I'll try and test things out locally.
That'll be great. If you can say how you tested, that'd be helpful to future reviews too.
I checked what status code it was by going to a page on CL that I know times out, and it returned a 504?
I mean, you could have a page with a 504 timeout issue but it's probably not the same issue as the 3 minute upload timeout issue(since the upload is not idle when the connection times out), does the 504 come back after 3 minutes or does it come back earlier on that page you tested with? If the 504 error response comes back before 3 minutes it's unlikely to be related as that's well before the worker timeout should trigger.
OK, good call. I reproduced the issue by visiting the slow URL, and it returned a 504, but I throttled the RECAP extension itself (in Chrome) while attempting to upload a very big document, and that threw at 502.
So looking at 502 errors, there's more like 1,000/day. That seems like enough to qualify as a problem!
If you can say how you tested, that'd be helpful to future reviews too.
Did some testing by tweaking the docker compose to use gunicorn. Static assets don't seem to be set up right but seems to be enough to verify basic django app functionality works as before.
diff --git a/docker/courtlistener/docker-compose.yml b/docker/courtlistener/docker-compose.yml
index ac3e7a8f6..248769d18 100644
--- a/docker/courtlistener/docker-compose.yml
+++ b/docker/courtlistener/docker-compose.yml
@@ -62,7 +62,7 @@ services:
cl-django:
container_name: cl-django
- image: ${DJANGO_DOCKER_IMAGE:-freelawproject/courtlistener:latest-web-dev}
+ image: ${DJANGO_DOCKER_IMAGE:-freelawproject/courtlistener:latest-web-prod}
depends_on:
- cl-postgresql
- cl-redis
diff --git a/docker/django/Dockerfile b/docker/django/Dockerfile
index eb0b9b961..503b46a48 100644
--- a/docker/django/Dockerfile
+++ b/docker/django/Dockerfile
@@ -111,12 +111,16 @@ CMD python /opt/courtlistener/manage.py migrate && \
#freelawproject/courtlistener:latest-web-prod
FROM python-base as web-prod
-CMD gunicorn cl_wsgi:application \
+CMD python /opt/courtlistener/manage.py migrate && \
+ python /opt/courtlistener/manage.py createcachetable && \
+ gunicorn cl_wsgi:application \
--chdir /opt/courtlistener/docker/django/wsgi-configs/ \
--user www-data \
--group www-data \
# Set high number of workers. Docs recommend 2-4× core count`
--workers ${NUM_WORKERS:-48} \
+ --worker-class gthread \
+ --threads 10 \
# Allow longer queries to solr.
--limit-request-line 6000 \
# Reset each worker once in a while
I then rebuilt the image and ran the docker compose env normally:
make image --file docker/django/Makefile -e VERSION=$(git rev-parse --short HEAD)
So looking at 502 errors, there's more like 1,000/day. That seems like enough to qualify as a problem!
Yep, that sounds a lot more plausible based on what I've seen.
Cool, thank you, this is re-assuring. I did some more reading and I'm sold. Right now we use k8s with:
# Gunicorn recommends 2-4 workers per CPU so this number is tied
# directly to the CPU requests key, below.
- name: NUM_WORKERS
value: "3"
And:
resources:
requests:
cpu: 300m
limits:
cpu: 600m
Spread across a bunch of pod replicas. From my reading, it seems like we might as well keep the workers and CPU where they are, and to add some threads. The pr doesn't have threads set up, and I haven't researched yet what that value should be, but it seems like we should do that at the same time, right?
One other comment. The gunicorn docs and default are not great. I get it if you're stuck behind a bad default from ages ago, but the docs seem like they should be louder about recommending gthread. It feels like only upside.
I found this article useful too, when thinking about workers vs. threads: https://stackoverflow.com/questions/38425620/gunicorn-workers-and-threads
From my reading, it seems like we might as well keep the workers and CPU where they are, and to add some threads. The pr doesn't have threads set up, and I haven't researched yet what that value should be, but it seems like we should do that at the same time, right?
Yeah, I think that makes the most sense, we probably want a good amount of threads(I now set it to 10 threads per worker without changing the total workers in the pr) since we need to handle potentially a bunch of concurrent uploads which only need their own threads and not their own workers(since uploads are io bound workloads until complete).
Cool. I merged it. It'll auto-deploy next time a staff member's PR gets merged (unfortunately non-staff PRs don't have the needed permissions).
Thank you for pushing on this and making sure it happened. I guess the next step is to switch all our microservices to use threads too....
Anything else to do here though?
I guess the next step is to switch all our microservices to use threads too....
Is anything needed for that or should that get picked up automatically?
Anything else to do here though?
Well, having a retry mechanism in the extension would probably be a good idea. Maybe with some sort of upload success/failure log that can be accessed by clicking on the extension icon. I think most users are unlikely to pull up the developer network/console logs so they likely won't report issues unless their is an easy way for them to see that they have an issue.
I'm not opposed to an auto-retry mechanism, except for the thundering herds problem it could create, I guess. I'm also +1 for some sort of upload log the user can see, though that has always seemed harder. If you want to pursue these ideas, let's get issues opened in freelawproject/recap so they can be hashed out.
For the microservices, no, they'd need to be tweaked same as the stuff here. I think the only two we really use are doctor, and disclosure extractor.
I think we can close this, then, right?
If you want to pursue these ideas, let's get issues opened in freelawproject/recap so they can be hashed out.
Opened tracking issue #338.
I think we can close this, then, right?
I think so.
Just a heads up, this deployed yesterday, so I repeated (my test from above using "Slow 3G" throttling to upload about 30MB](https://github.com/freelawproject/courtlistener/issues/2640#issuecomment-1512425599). After about four minutes, it was still going, so I disabled throttling. We're now several hours later and it's still hanging, uncompleted.
It seems like "Slow 3G" means about 52kBps, which would take about 10 minutes to do 30MB, so something went wrong.
Meanwhile, I looked at our 502 logs in the ALB and they don't seem to be changed much. The line seems flatter, but it's not like our 502's are fixed. The line is flatter, but still present:
(Units are 502's per hour.)
Meanwhile, I looked at our 502 logs in the ALB and they don't seem to be changed much. The line seems flatter, but it's not like our 502's are fixed.
Yeah, from the looks of it there's an overall reliability improvement and no longer a hard 3 minute timeout...but still reliability issues for larger uploads on slower connections...so we must be hitting another issue then.
How does resource usage look? Is there any memory pressure that might be killing worker processes?
How does resource usage look? Is there any memory pressure that might be killing worker processes?
I'm not sure how to answer this, actually. Memory pressure on the nodes is fine. Individual pods could have memory pressure, but if they were running out, I think they'd get killed, right, and I'd see that?
Individual pods could have memory pressure, but if they were running out, I think they'd get killed, right, and I'd see that?
Hard to say, often stuff just gets respawned automatically when getting OOM killed. What's the breakdown for how memory is provisioned?
Each gunicorn pod has a 1G: memory: 1G
.
The nodes are about 49% memory utilization.
Each gunicorn pod has a 1G:
memory: 1G
.
This seems like it might be a bit on the low side, is there tracking of workers that get OOM killed? It might be better to use fewer pods with more memory per pod.
I'm not sure, actually. I think we'd see killed pods in k9s, right? I don't see that?
I think we'd see killed pods in k9s, right?
It might just be worker processed dying and getting restarted by gunicorn I'm thinking.
I agree with ttys0dev. gunicorn creates workers. Workers can die and gunicorn lives on. It will restart workers that die. It is one of the ways that it handles memory leaks. Just kill off a worker when it has done enough work and spawn a new worker.
On Thu, May 11, 2023 at 2:12 PM ttys0dev @.***> wrote:
I think we'd see killed pods in k9s, right?
It might just be worker processed dying and getting restarted by gunicorn I'm thinking.
— Reply to this email directly, view it on GitHub https://github.com/freelawproject/courtlistener/issues/2640#issuecomment-1544463669, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATTKVMWNWKF27BXY7U2E4JLXFUT2PANCNFSM6AAAAAAW3AJ2CM . You are receiving this because you were mentioned.Message ID: @.***>
Got it. That'd be in the logs for the pod. I'll see what I can find.
Huh, I don't know what the problem could be?
│ [2023-05-11 18:13:57 +0000] [7] [WARNING] Worker with pid 14824 was terminated due to signal 9
│ [2023-05-11 18:13:57 +0000] [14852] [INFO] Booting worker with pid: 14852
│ [2023-05-11 18:14:42 +0000] [7] [WARNING] Worker with pid 14838 was terminated due to signal 9
│ [2023-05-11 18:14:42 +0000] [14866] [INFO] Booting worker with pid: 14866
│ Received unknown event
│ [2023-05-11 18:15:40 +0000] [7] [WARNING] Worker with pid 14812 was terminated due to signal 9
│ [2023-05-11 18:15:40 +0000] [14882] [INFO] Booting worker with pid: 14882
│ [2023-05-11 18:16:24 +0000] [7] [WARNING] Worker with pid 14866 was terminated due to signal 9
│ [2023-05-11 18:16:24 +0000] [14900] [INFO] Booting worker with pid: 14900
│ [2023-05-11 18:16:53 +0000] [7] [WARNING] Worker with pid 14882 was terminated due to signal 9
│ [2023-05-11 18:16:53 +0000] [14912] [INFO] Booting worker with pid: 14912
│ /opt/courtlistener/cl/api/pagination.py:23: UnorderedObjectListWarning: Pagination may yield inconsistent results with an unordered object_list: <class 'cl.favorites.models.DocketTag'> QuerySet. │
│ paginator = self.django_paginator_class(queryset, page_size)
│ [2023-05-11 18:17:28 +0000] [7] [WARNING] Worker with pid 14900 was terminated due to signal 9
│ [2023-05-11 18:17:28 +0000] [14928] [INFO] Booting worker with pid: 14928
│ Received unknown event
│ [2023-05-11 18:17:46 +0000] [7] [WARNING] Worker with pid 14912 was terminated due to signal 9
│ [2023-05-11 18:17:46 +0000] [14940] [INFO] Booting worker with pid: 14940
│ [2023-05-11 18:18:51 +0000] [7] [WARNING] Worker with pid 14852 was terminated due to signal 9
Signal 9 is SIGKILL. SIGKILL tells a process to stop, right now. No ability to catch it. The only thing that keeps a process alive after it receives a SIGKILL is if it is waiting for IO. (Note: IIRC about only, pretty much the only reason I forcibly reboot a server is when it locks a process on I/O.)
Sorry, that was supposed to be a joke. @ttys0dev has a PR that reduces our thread count that I just merged. Hopefully that'll help. We'll have to monitor, I think.
Hopefully that'll help. We'll have to monitor, I think.
What do the log files look like now? Any change?
Still seeing signal 9 kills. Pretty weird. I'd expect 1GB to be enough to serve the requests. Here's the full line from ps aux
on one of the workers:
/bin/sh -c gunicorn cl_wsgi:application --chdir /opt/courtlistener/docker/django/wsgi-configs/ --user www-data --group www-data --workers ${NUM_WORKERS:-48} --worker-class gthread --thr
eads 2 --limit-request-line 6000 --max-requests 10000 --max-requests-jitter 100 --timeout 180 --bind 0.0.0.0:8000
And for reference:
% echo ${NUM_WORKERS:-48}
3
So three workers, 2 threads. I assume that means six requests at a time, which 1GB sure feels like enough to handle? One thing I haven't noticed is the number of pods going up since this change. It's nine pods in the deployment right now.
Right now we use k8s with:
# Gunicorn recommends 2-4 workers per CPU so this number is tied # directly to the CPU requests key, below. - name: NUM_WORKERS value: "3"
And:
resources: requests: cpu: 300m limits: cpu: 600m
Spread across a bunch of pod replicas.
I'm confused here, it seems like this would mean you're hard limiting workers to 600m and not 1GB. Maybe try massively increasing the allocated resources per pod so that you have fewer pods with more resources instead of a lot of low resource pods(which seems like it would cause the OOM killer to be overly aggressive)?
That's the CPU limit per pod. The memory limit is:
resources:
requests:
cpu: 300m
limits:
cpu: 600m
memory: 1G
Yeah, I guess the next step is to up that memory and see if it fixes it. Memory is where things get expensive, but I can put that on my list to do. Maybe just bump it to 1.5GB.
That's the CPU limit per pod.
Hmm, apparently you shouldn't be setting a CPU limit at all.
Just looking in htop
, I see about 30 gunicorn processes. That's...odd, right?
Yeah, I guess the next step is to up that memory and see if it fixes it. Memory is where things get expensive, but I can put that on my list to do. Maybe just bump it to 1.5GB.
Hmm, so apparently memory limits and requests should be set to the same value.
Just looking in
htop
, I see about 30 gunicorn processes. That's...odd, right?
Yeah, that does look strange, is that just one container or multiple? Maybe celery is doing something weird forking additional background processes off of the gunicorn workers? Also any idea why is it at the top showing 7.30G/15.3G
if memory is limited to 1GB?
Just looking in
htop
, I see about 30 gunicorn processes. That's...odd, right?
Oh, so looks like htop
defaults to showing threads and processes, so I think that's normal, looks like there's only three actual gunicorn processes(the ones colored white).
Our memory limit == our memory request. I just bumped it to 1.5G. We'll see if that helps. I think htop inside a container is only the processes in that container. I don't see celery in there, for example.
I did notice that when looking at the pods, they were all using all their memory, so that's more evidence of that issue. I'm surprised they need so much just to run gunicorn though, that's for sure. I wonder if some dependency is pulling in way too much code (we do have a lot of dependencies).
Within a few minutes of writing that some pods were already hitting their 1.5G limit:
So I gave 'em 2G! Let's see if that helps.
Well, it's up to 96% of that. Something ain't right here. Could it be gthread
eating memory somehow? Three workers with a few threads shouldn't be going nuts like this, right?
I'm surprised they need so much just to run gunicorn though, that's for sure.
Well gunicorn itself shouldn't use much memory.
I wonder if some dependency is pulling in way too much code (we do have a lot of dependencies).
Might be a memory leak I'm thinking.
Could it be
gthread
eating memory somehow?
I don't think it would be that, however if there was a preexisting memory leak...might be some impact.
Did increasing the memory reduce the worker OOM kills?
Might be a memory leak I'm thinking.
Yeah, something is wrong. I put it at 2GB per pod and all the pods are hovering around 90+% while killing workers.
I'd be pretty surprised if we have a memory leak unless it was a gthread thing. We don't use threading or mess with memory otherwise.
Oh, so looks like htop defaults to showing threads and processes, so I think that's normal, looks like there's only three actual gunicorn processes(the ones colored white).
This still seems strange to me. If we have three workers with three threads each, shouldn't that show up in htop as nine things? OTOH, maybe the reason it shows 16GB of memory in htop is because it's showing all processes/threads for all pods on the node (but it's still weird because where are the celery pods, or the doctor pods?)
Anyway, something is very wrong here now. I don't know if we had the memory issue before, but we should really see if we can get on top of it. Any ideas?
I'd be pretty surprised if we have a memory leak unless it was a gthread thing. We don't use threading or mess with memory otherwise.
I think it's fairly unlikely gthread itself has a memory leak, it's fairly widely used and would be something that would get fixed quickly if it were an issue. I'm thinking we must have some dependencies using threading somewhere. We are definitely using memory, by that I mean our python objects use memory and can effectively cause memory leaks if references somehow get held when they should be dropped.
This still seems strange to me. If we have three workers with three threads each, shouldn't that show up in htop as nine things?
Yeah, I think there's more threads coming from somewhere else, guessing some library.
OTOH, maybe the reason it shows 16GB of memory in htop is because it's showing all processes/threads for all pods on the node (but it's still weird because where are the celery pods, or the doctor pods?)
Well memory is allocated by process, threads within the process share the memory.
Anyway, something is very wrong here now. I don't know if we had the memory issue before, but we should really see if we can get on top of it. Any ideas?
Well my ASGI PR may help since it does a bit of refactoring, but hard to say, the django ASGI implementation should avoid threading when possible in some request codepaths(it uses a threadpool when required rather than all the time like with gthread). But if there's a memory leak...good chance it would get hit there as well.
I made an attempt at memory profiling the django application but wasn't able to get much of anything useful from that, it's apparently quite tricky to trace django memory leaks in general.
Hm, well, I guess the first step is to reproduce it. On a dev machine, we use the Django webserver. I suppose we should try using gunicorn with gthread and see if we can make memory go up?
I'm thinking we must have some dependencies using threading somewhere. We are definitely using memory, by that I mean our python objects use memory and can effectively cause memory leaks if references somehow get held when they should be dropped.
I'm going to regret saying this, but I can't think of anything fancy that we're using that could cause something like this.
Hm, well, I guess the first step is to reproduce it. On a dev machine, we use the Django webserver. I suppose we should try using gunicorn with gthread and see if we can make memory go up?
I mean, from what I could tell memory goes up with the normal django dev webserver as well, so it doesn't seem to be gthread specific. Although the dev server kind of intentionally leaks memory apparently by storing stuff like sql query history in a buffer.
I'm going to regret saying this, but I can't think of anything fancy that we're using that could cause something like this.
Ehhhh...with a codebase of this size and complexity...I think there's a lot of potential places we could be leaking memory even if we're not really doing something fancy/unusual. We're moving around and serializing/deserializing/parsing a lot of reasonably large requests/files in memory which could in theory cause massive memory leaks if there's a bug somewhere that holds an object reference when it should be dropped.
I'm thinking we must have some dependencies using threading somewhere.
There could also just be a random bug in a dependency of ours, once #2744 is merged I'll try and go through all our deps and see what's still on old versions which have updates available in case there's a fix.
gunicorn has the setting for only serving so many requests/worker. Maybe we set that to something lower and see if that helps.
Well, I upped the pods to 3GB and they seem stable without dropping workers. I still think something is off though. We're at about 9GB of memory to serve the site right now, just consumed by gunicorn. That feels like a lot. I wish I had the stats for this from before we switched to gthread.
gunicorn has the setting for only serving so many requests/worker. Maybe we set that to something lower and see if that helps.
I think that's more of a workaround rather than a proper fix.
Well, I upped the pods to 3GB and they seem stable without dropping workers. I still think something is off though. We're at about 9GB of memory to serve the site right now, just consumed by gunicorn. That feels like a lot. I wish I had the stats for this from before we switched to gthread.
From the profiling I was able to do the baseline application memory usage does seem to be rather high, seems some of the memory usage may be due to pulling in large dependencies/libraries like numpy/scipy.
Creating an issue to track the known timeout related upload failure bugs as they can be difficult to diagnose.
If anyone thinks they are affected by this bug please add a comment as most users are unlikely to report these sort of transient failures, if you have an upload speed that is not reliably above 2.22Mbits/s you will definitely be effected by this issue for larger uploads so please comment here as well if that is the case for you.
Various known issues preventing reliable uploads:
Due to a lack of a retry mechanism in the recap extension transient connection and server side failures can also cause permanent upload failures.
There is a known gunicorn misconfiguration issue that prevents large uploads over slower connections due to gunicorn workers being killed prematurely after 3 minutes regardless of if there is an in progress upload. This is due to gunicorn sync workers not being suitable for production use without a caching reverse proxy in front.
Workers/upload connections occasionally appear to get killed before the 3 minute timeout, this coupled with the lack of a retry mechanism in recap causes the upload to fail entirely, it's less clear why this occurs but it appears to coincide with error messages indicating the server is overloaded so workers being killed is likely related.
There is no obvious way for users to properly identify/debug these issues as they are primarily caused by a server side issue and are often transient. The issue can be seen in the network log of the extension.
It appears some larger zip uploads which succeed to upload initially fail to fully process before the worker gets killed(for example only half of the files in the zip process successfully on the first try).
Various potential fixes:
Put a reverse proxy in front of gunicorn that caches all uploads before sending them to gunicorn, this would ensure that slow uploads don't fail due to workers being killed.
Add an upload retry on failure mechanism to the recap extension with appropriate http response status code handling.
Reconfigure gunicorn to be less prone to workers being killed while uploads are active.
Additional recommendations to make future diagnostics easier:
Add server side monitoring which tracks upload timeouts and worker timeouts so that these issues aren't missed.
Add client side upload failure monitoring/reporting mechanisms to the recap extension so that upload failure rates can be tracked appropriately.