WordPress / openverse-api

The Openverse API allows programmatic access to search for CC-licensed and public domain digital media.
https://api.openverse.engineering/v1
MIT License
77 stars 51 forks source link

Move off of grequests #486

Closed AetherUnbound closed 1 year ago

AetherUnbound commented 2 years ago

Problem

The API currently uses the grequests library for handling some of the head requests it completes when returning results (specifically in validate_images.py).

Description

The README for the project says:

Note: You should probably use requests-threads or requests-futures instead.

The project has also not had a release since 2020. We might want to consider using one of those proposed alternatives.

Alternatives

Leave as is and hope for the best!

Additional context

Some more context in Sentry: https://sentry.io/share/issue/061ba99fc3df4c23bdb7643d337bbda0/

Implementation

sarayourfriend commented 2 years ago

This would be a great issue for a new contributor to pick up, if anyone is looking for something to get started with 😀

samajain commented 2 years ago

Hi @sarayourfriend,

I came across your project and I am interested in contributing! I am new to opensource community contributions but I have 5 years of experience working as an engineer with python, C++ and django frameworks. Can I work on this issue?

From the description, what I concluded is that we want to migrate from grequests to some other multi-threaded handling of requests, possibly requests-threads? Please share any other information which you think might be needed to work on the issue.

Thanks!

sarayourfriend commented 2 years ago

@samajain Hi! Welcome to Openverse :grin: It'd be great to have you work on this issue. Your understanding of the issue and one of the potential solutions is correct.

request-threads could be a good issue but we don't currently use Twisted, which requests-threads depends on :thinking: I wonder if we could also look into using Django's builtin asyncio support and using something like httpx or one of the other popular async HTTP libraries. requests-threads hasn't been updated in a few years (including not having technical support for the latest Python versions).

If you feel comfortable doing that kind of research for the trade-offs between the different options then let us know :slightly_smiling_face: Unfortunately this issue might not be as clear cut as it seems now that I've looked into it a bit more :confused: If you'd prefer working one something more straight forward then also let me know and I can help you find something else.

Thanks for your interest in the project!

samajain commented 2 years ago

I can do the research if you want me to and share my findings.

If not this issue then I can take up something else too 🙂 Since you have been part of this project, I leave it to your call.

I'll dig up for something from the good first issues label.

AetherUnbound commented 2 years ago

We'd be more than happy to let you take this one on! If you're comfortable gathering the research and posting it here, we can work together to decide which library/path forward might be best :slightly_smiling_face:

samajain commented 2 years ago

Sure! I can do that. What kind of tradeoff aspects are you looking for? Shall I compare them based on their performance metrics or are you looking for a comparison on anything specific other than this?

AetherUnbound commented 2 years ago

I'll let @sarayourfriend chime in with any other thoughts, but I think my greatest concern would be 1) what is the maintenance burden we incur with whatever dependency we go with and 2) how much restructuring of the codebase do we need to do in order to properly integrate the library. For instance, if we do go with an async-based library, how much will we need to change to make the application support async calls. Performance metrics would be useful but secondary I think to these factors :smile:

sarayourfriend commented 2 years ago

I'm less concerned with the maintenance burden initially than with the performance impact of using async in Django 4. The Django docs mention that using async views (which is fully supported and does not require any restructuring, apparently :thinking:) has some performance trade offs. It doesn't explain what those are, so it'd be nice to understand them. We could move the Django app to ASGI and run it with uvicorn instead of gunicorn, but that my gut tells me that's probably a bigger move than we need to solve this particular problem. It would also need to involve larger infrastructure changes that we don't necessarily have capacity for at the moment. To quote the Django docs:

Async views will still work under WSGI, but with performance penalties, and without the ability to have efficient long-running requests.

The only place we're using grequests right now is basically to fire off a bunch of HTTP requests in parallel, so the long-running requests issue isn't important. However, the usage is in the main search routine, so it's extremely critical path. It's perhaps the view where performance matters the most for us.

My preference would be to use the built in Django async support as it's in line with mainline Python support (and I think is generally where things are headed, mostly away from Twisted, etc). But if the performance impact is bad, then we'll necessarily need to suffer the maintenance burden of introducing a "bolted on" asynchronous library for now.

All of that to say, I'd recommend looking into what the trade offs are with using Django's built in asyncio view support under WSGI first, then moving onto seeing what other options exist if the penalties for partial async in Django WSGI app are bad.

AetherUnbound commented 2 years ago

However, the usage is in the main search routine, so it's extremely critical path. It's perhaps the view where performance matters the most for us.

That's a great point, thanks for mentioning it! And I'm glad that it doesn't seem like async support in Django will be too difficult from a code perspective.

samajain commented 2 years ago

@AetherUnbound @sarayourfriend Thank you for your inputs. I will do my research and get back soon!

sarayourfriend commented 2 years ago

@samajain Did you get a chance to look into this?

Given Django's recent release of 4.2 adding async to the ORM, I think it is a good time to start planning a switch over to the async Django API. I don't think this will be a trivial change to do correctly and I don't know if it can be done in stages. Would it be detrimental, for example, to delay switching to async Redis and Elasticsearch clients and Django ORM to happen after switching to async Django generally to be able to use an async http library to replace our usage of grequests?

sarayourfriend commented 2 years ago

I was reading more about Django's async support. It looks like we might be able to relatively easily get off of grequests by using Django's in-depth async_to_sync function to run an async version of the validate_images function.

@WordPress/openverse-maintainers I think we could prioritize this issue higher if someone wanted to try out async_to_sync with a regular asyncio HTTP Python library like httpx or similar.

AetherUnbound commented 2 years ago

I am all for this!

Prakharkarsh1 commented 2 years ago

Hii @AetherUnbound I also want to contribute to fix this issue

dhruvkb commented 2 years ago

@Prakharkarsh1 please go ahead! Feel free to comment here if you need any pointers getting started.

sarayourfriend commented 2 years ago

I took a peek at this today and moving to ASGI for our application would be mostly trivial (I think). The bulk of the work would be to remove grequests and replacing it with something like httpx or aiohttp (plus maybe AnyIO to make "grouping" the requests easier, if the libraries themselves do not natively support it).

Updating unit tests would probably take up most of the time spent on this. Due to grequests nature, the tests do a fair bit of unusual mocking to try to stay out of the way as much as possible, but it's very specific to grequests. Perhaps adding an abstraction layer here would help to simplify the tests in general.

sarayourfriend commented 2 years ago

Here is a small diff that moves the app to ASGI:

diff --git a/api/catalog/asgi.py b/api/catalog/asgi.py
new file mode 100644
index 00000000..6b3169f4
--- /dev/null
+++ b/api/catalog/asgi.py
@@ -0,0 +1,7 @@
+import os
+
+from django.core.asgi import get_asgi_application
+
+os.environ.setdefault("DJANGO_SETTINGS_MODULE", "catalog.settings")
+
+application = get_asgi_application()
diff --git a/api/catalog/settings.py b/api/catalog/settings.py
index 268ebb62..192feca2 100644
--- a/api/catalog/settings.py
+++ b/api/catalog/settings.py
@@ -236,7 +236,7 @@ TEMPLATES = [
     },
 ]

-WSGI_APPLICATION = "catalog.wsgi.application"
+ASGI_APPLICATION = "catalog.aisgi.application"

 # Database
 # https://docs.djangoproject.com/en/2.0/ref/settings/#databases
diff --git a/api/catalog/wsgi.py b/api/catalog/wsgi.py
deleted file mode 100644
index f8458d06..00000000
--- a/api/catalog/wsgi.py
+++ /dev/null
@@ -1,23 +0,0 @@
-"""
-WSGI config for catalog project.
-
-It exposes the WSGI callable as a module-level variable named ``application``.
-
-For more information on this file, see
-https://docs.djangoproject.com/en/2.0/howto/deployment/wsgi/
-"""
-from gevent import monkey
-
-
-monkey.patch_all()
-import os
-
-from django.core.wsgi import get_wsgi_application
-
-from wsgi_basic_auth import BasicAuth
-
-
-os.environ.setdefault("DJANGO_SETTINGS_MODULE", "catalog.settings")
-
-application = get_wsgi_application()
-application = BasicAuth(application)

I've done no performance checking on this, but adding async to view routes and switching to the async base classes (if we need to do that?) seems like it would be the only bit left to make it work fully async. ORM methods might need to be changed as well...

It doesn't look like we're using any middleware that causes Django to have to adapt each request. I checked this by setting the log level to debug and checking for the django.logger logs about middleware adapting.

dhruvkb commented 2 years ago

@sarayourfriend would switching to ASGI mean that we need to change the server from Gunicorn to something else?

sarayourfriend commented 2 years ago

Yes, we would switch the production server to hypercorn, uvcorn, etc. I've used hypercorn in the past and liked it, it has almost the same api as gunicorn.

sarayourfriend commented 2 years ago

It completely slipped by me that we would need DRF to also support async views for migrating to async Django to be possible. DRF still does not support async and it seems like it might not natively support it for a while (though a package could do the job).

The two alternatives listed by grequests (and mentioned in the issue description) are actually less recently maintained than grequests itself.

We could try just using Django's async_to_sync but I think we would pay a performance penalty from spinning up an event loop.

The Sentry issue continues to record events, so this problem does still exist today.

As I see it, there are two options:

  1. Try async_to_sync and measure whether there is a difference in performance in production. Revert if there is an unacceptable performance difference or a regression otherwise.
  2. Pull the async DRF implementation out of the PR and write our own async shim as recommended in the PR itself and dive head first into full async Django.

My preference is the second as (a) we'll want to move to async Django in the future anyway and (b) if we do the shimming right, it should be possible to swap it out for whatever more developed async DRF package comes out of the PR or we can just switch to async DRF when it is officially supported (if ever).

I'm going to tinker a bit with the second option as I am working on this mostly as a curiosity rather than it being a priority. I hope that is okay with folks. If anything I'm sure I'll learn something useful about our app and async generally in the process that will be applicable to a future async conversion (if one doesn't come out of this practice).

sarayourfriend commented 1 year ago

I do not think it is worth the fuss to do number 2 right now. I've tried for a bit to get the second option working and it is going to take quite a bit of work from us to convert things to async with DRF.

Trying async_to_sync is probably the best approach here. We will want to implement some timings for the API first, however. Adding timeit around the validate_images function and logging the timings would be a good first step here.

sarayourfriend commented 1 year ago

Drafted a PR for this long-standing issue today using aiohttp and async_to_sync. It was relatively painless (though the test rewrite I expect to be a struggle): https://github.com/WordPress/openverse-api/pull/1027

sarayourfriend commented 1 year ago

I was just re-reading my suggestion above and regarding the timeits, I don't think it's necessary... we can keep an eye on the existing request timing measurements we have and as long as those are stable this change should be safe, right? If we added more timing logs we wouldn't be in a much better place regarding identifying the root cause of a strange issue. We'll probably want to make sure that PR doesn't get released with another significant change though.