WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
240 stars 194 forks source link

Remove client application headers from API responses #4569

Closed sarayourfriend closed 2 months ago

sarayourfriend commented 3 months ago

Problem

https://github.com/WordPress/openverse/blob/2cb8b5eb11f0abb39f79697ae2b8ddc065b7e0f7/api/api/middleware/response_headers_middleware.py#L22-L24

The code linked above adds the client application name and verification status to response headers. This is useful when we scan Nginx request logs. However, it also makes responses that would not otherwise be sensitive to the authorization header suddenly sensitive.

Thumbnails, for example, can be universally cached at the edge without issue. However, when an authorization header is present, the cache (correctly) will bypass and send the request upstream. That is only correct and necessary behaviour due to our inclusion of authorization sensitive materials in responses.

Description

If we remove these headers from responses (and instead log them to our structured API logging), we can use more aggressive caching for thumbnails that does not vary on the authorization header.

krysal commented 3 months ago

@sarayourfriend To confirm my understanding of the desired solution, will transforming the response_headers_middleware middleware into a response_logging_middleware and replacing the added headers with the following logging do the job?

logger.info(
    "application_response_verification_status",
    application_id=application.id,
    application_name=application.name,
    application_verified=application.verified,
)

Or is there a better place for logging this info? I tried extending request log metadata as the django-structlog documentation suggest but I couldn't get the application's information in conf/settings/signals.py. I thought it could be the placing of the middleware in the base config list but it doesn't seem to be that.

# File: api/conf/settings/signals.py

@receiver(signals.bind_extra_request_metadata)
def bind_application_id(request, logger, **kwargs):
    if not (hasattr(request, "auth") and hasattr(request.auth, "application")):
        return

    # This code is never reached ❌
    application = request.auth.application
    structlog.contextvars.bind_contextvars(application_id=application.id)
sarayourfriend commented 3 months ago

Your overall understanding is correct :+1: I don't think we need to use the middleware, though.

I haven't looked into the implementation of that signal, but I'm going to guess it must happen before authentication occurs, if request.auth isn't available? But it's weird that the documentation says user_id is included, which would seem to imply it happens after authentication. Maybe there's an issue with how we've ordered the middlewares?

If that approach with the signals isn't possible, then I think it would be fine to add logging somewhere around here, in the authentication class itself: https://github.com/WordPress/openverse/blob/main/api/conf/oauth2_extensions.py#L30-L32

I don't think we necessarily need to add these properties to every log line, which it looks like extending the request metadata would do. Instead, we can have a specific log event client_application_authentication (or whatever makes sense) that logs these pieces of information. Then, if we need to identify the client app for a request, we can query for that event where request_id=<the request id of the other request/event we're investigating>.