Closed AetherUnbound closed 1 year ago
This would be a great issue for a new contributor to pick up, if anyone is looking for something to get started with 😀
Hi @sarayourfriend,
I came across your project and I am interested in contributing! I am new to opensource community contributions but I have 5 years of experience working as an engineer with python, C++ and django frameworks. Can I work on this issue?
From the description, what I concluded is that we want to migrate from grequests to some other multi-threaded handling of requests, possibly requests-threads? Please share any other information which you think might be needed to work on the issue.
Thanks!
@samajain Hi! Welcome to Openverse :grin: It'd be great to have you work on this issue. Your understanding of the issue and one of the potential solutions is correct.
request-threads
could be a good issue but we don't currently use Twisted, which requests-threads
depends on :thinking: I wonder if we could also look into using Django's builtin asyncio support and using something like httpx or one of the other popular async HTTP libraries. requests-threads
hasn't been updated in a few years (including not having technical support for the latest Python versions).
If you feel comfortable doing that kind of research for the trade-offs between the different options then let us know :slightly_smiling_face: Unfortunately this issue might not be as clear cut as it seems now that I've looked into it a bit more :confused: If you'd prefer working one something more straight forward then also let me know and I can help you find something else.
Thanks for your interest in the project!
I can do the research if you want me to and share my findings.
If not this issue then I can take up something else too 🙂 Since you have been part of this project, I leave it to your call.
I'll dig up for something from the good first issues label.
We'd be more than happy to let you take this one on! If you're comfortable gathering the research and posting it here, we can work together to decide which library/path forward might be best :slightly_smiling_face:
Sure! I can do that. What kind of tradeoff aspects are you looking for? Shall I compare them based on their performance metrics or are you looking for a comparison on anything specific other than this?
I'll let @sarayourfriend chime in with any other thoughts, but I think my greatest concern would be 1) what is the maintenance burden we incur with whatever dependency we go with and 2) how much restructuring of the codebase do we need to do in order to properly integrate the library. For instance, if we do go with an async-based library, how much will we need to change to make the application support async
calls. Performance metrics would be useful but secondary I think to these factors :smile:
I'm less concerned with the maintenance burden initially than with the performance impact of using async in Django 4. The Django docs mention that using async views (which is fully supported and does not require any restructuring, apparently :thinking:) has some performance trade offs. It doesn't explain what those are, so it'd be nice to understand them. We could move the Django app to ASGI and run it with uvicorn instead of gunicorn, but that my gut tells me that's probably a bigger move than we need to solve this particular problem. It would also need to involve larger infrastructure changes that we don't necessarily have capacity for at the moment. To quote the Django docs:
Async views will still work under WSGI, but with performance penalties, and without the ability to have efficient long-running requests.
The only place we're using grequests right now is basically to fire off a bunch of HTTP requests in parallel, so the long-running requests issue isn't important. However, the usage is in the main search routine, so it's extremely critical path. It's perhaps the view where performance matters the most for us.
My preference would be to use the built in Django async support as it's in line with mainline Python support (and I think is generally where things are headed, mostly away from Twisted, etc). But if the performance impact is bad, then we'll necessarily need to suffer the maintenance burden of introducing a "bolted on" asynchronous library for now.
All of that to say, I'd recommend looking into what the trade offs are with using Django's built in asyncio view support under WSGI first, then moving onto seeing what other options exist if the penalties for partial async in Django WSGI app are bad.
However, the usage is in the main search routine, so it's extremely critical path. It's perhaps the view where performance matters the most for us.
That's a great point, thanks for mentioning it! And I'm glad that it doesn't seem like async support in Django will be too difficult from a code perspective.
@AetherUnbound @sarayourfriend Thank you for your inputs. I will do my research and get back soon!
@samajain Did you get a chance to look into this?
Given Django's recent release of 4.2 adding async to the ORM, I think it is a good time to start planning a switch over to the async Django API. I don't think this will be a trivial change to do correctly and I don't know if it can be done in stages. Would it be detrimental, for example, to delay switching to async Redis and Elasticsearch clients and Django ORM to happen after switching to async Django generally to be able to use an async http library to replace our usage of grequests?
I was reading more about Django's async support. It looks like we might be able to relatively easily get off of grequests by using Django's in-depth async_to_sync
function to run an async
version of the validate_images
function.
@WordPress/openverse-maintainers I think we could prioritize this issue higher if someone wanted to try out async_to_sync
with a regular asyncio HTTP Python library like httpx or similar.
I am all for this!
Hii @AetherUnbound I also want to contribute to fix this issue
@Prakharkarsh1 please go ahead! Feel free to comment here if you need any pointers getting started.
I took a peek at this today and moving to ASGI for our application would be mostly trivial (I think). The bulk of the work would be to remove grequests and replacing it with something like httpx or aiohttp (plus maybe AnyIO to make "grouping" the requests easier, if the libraries themselves do not natively support it).
Updating unit tests would probably take up most of the time spent on this. Due to grequests nature, the tests do a fair bit of unusual mocking to try to stay out of the way as much as possible, but it's very specific to grequests. Perhaps adding an abstraction layer here would help to simplify the tests in general.
Here is a small diff that moves the app to ASGI:
diff --git a/api/catalog/asgi.py b/api/catalog/asgi.py
new file mode 100644
index 00000000..6b3169f4
--- /dev/null
+++ b/api/catalog/asgi.py
@@ -0,0 +1,7 @@
+import os
+
+from django.core.asgi import get_asgi_application
+
+os.environ.setdefault("DJANGO_SETTINGS_MODULE", "catalog.settings")
+
+application = get_asgi_application()
diff --git a/api/catalog/settings.py b/api/catalog/settings.py
index 268ebb62..192feca2 100644
--- a/api/catalog/settings.py
+++ b/api/catalog/settings.py
@@ -236,7 +236,7 @@ TEMPLATES = [
},
]
-WSGI_APPLICATION = "catalog.wsgi.application"
+ASGI_APPLICATION = "catalog.aisgi.application"
# Database
# https://docs.djangoproject.com/en/2.0/ref/settings/#databases
diff --git a/api/catalog/wsgi.py b/api/catalog/wsgi.py
deleted file mode 100644
index f8458d06..00000000
--- a/api/catalog/wsgi.py
+++ /dev/null
@@ -1,23 +0,0 @@
-"""
-WSGI config for catalog project.
-
-It exposes the WSGI callable as a module-level variable named ``application``.
-
-For more information on this file, see
-https://docs.djangoproject.com/en/2.0/howto/deployment/wsgi/
-"""
-from gevent import monkey
-
-
-monkey.patch_all()
-import os
-
-from django.core.wsgi import get_wsgi_application
-
-from wsgi_basic_auth import BasicAuth
-
-
-os.environ.setdefault("DJANGO_SETTINGS_MODULE", "catalog.settings")
-
-application = get_wsgi_application()
-application = BasicAuth(application)
I've done no performance checking on this, but adding async
to view routes and switching to the async base classes (if we need to do that?) seems like it would be the only bit left to make it work fully async. ORM methods might need to be changed as well...
It doesn't look like we're using any middleware that causes Django to have to adapt each request. I checked this by setting the log level to debug and checking for the django.logger
logs about middleware adapting.
@sarayourfriend would switching to ASGI mean that we need to change the server from Gunicorn to something else?
Yes, we would switch the production server to hypercorn, uvcorn, etc. I've used hypercorn in the past and liked it, it has almost the same api as gunicorn.
It completely slipped by me that we would need DRF to also support async views for migrating to async Django to be possible. DRF still does not support async and it seems like it might not natively support it for a while (though a package could do the job).
The two alternatives listed by grequests (and mentioned in the issue description) are actually less recently maintained than grequests itself.
We could try just using Django's async_to_sync
but I think we would pay a performance penalty from spinning up an event loop.
The Sentry issue continues to record events, so this problem does still exist today.
As I see it, there are two options:
async_to_sync
and measure whether there is a difference in performance in production. Revert if there is an unacceptable performance difference or a regression otherwise.My preference is the second as (a) we'll want to move to async Django in the future anyway and (b) if we do the shimming right, it should be possible to swap it out for whatever more developed async DRF package comes out of the PR or we can just switch to async DRF when it is officially supported (if ever).
I'm going to tinker a bit with the second option as I am working on this mostly as a curiosity rather than it being a priority. I hope that is okay with folks. If anything I'm sure I'll learn something useful about our app and async generally in the process that will be applicable to a future async conversion (if one doesn't come out of this practice).
I do not think it is worth the fuss to do number 2 right now. I've tried for a bit to get the second option working and it is going to take quite a bit of work from us to convert things to async with DRF.
Trying async_to_sync
is probably the best approach here. We will want to implement some timings for the API first, however. Adding timeit around the validate_images
function and logging the timings would be a good first step here.
Drafted a PR for this long-standing issue today using aiohttp and async_to_sync. It was relatively painless (though the test rewrite I expect to be a struggle): https://github.com/WordPress/openverse-api/pull/1027
I was just re-reading my suggestion above and regarding the timeits, I don't think it's necessary... we can keep an eye on the existing request timing measurements we have and as long as those are stable this change should be safe, right? If we added more timing logs we wouldn't be in a much better place regarding identifying the root cause of a strange issue. We'll probably want to make sure that PR doesn't get released with another significant change though.
Problem
The API currently uses the
grequests
library for handling some of thehead
requests it completes when returning results (specifically invalidate_images.py
).Description
The README for the project says:
The project has also not had a release since 2020. We might want to consider using one of those proposed alternatives.
Alternatives
Leave as is and hope for the best!
Additional context
Some more context in Sentry: https://sentry.io/share/issue/061ba99fc3df4c23bdb7643d337bbda0/
Implementation