divviup / janus

Experimental implementation of the Distributed Aggregation Protocol (DAP) specification.
Mozilla Public License 2.0
53 stars 15 forks source link

Upload requests getting responses with 404/no status set #2515

Closed divergentdave closed 5 months ago

divergentdave commented 10 months ago

In production, we are seeing occasional requests in the sum(rate(http_server_request_duration_seconds_count{http_response_status_code="404",http_route="/tasks/:task_id/reports"}[2m])) time series, at approximately one per minute. This has held steady across our entire metrics lookback period. These correlate perfectly with sum(rate(janus_aggregator_responses_total{error_code="unknown",route="/tasks/:task_id/reports"}[2m])). I don't think we explicitly set a status code of 404/Status::NOT_FOUND anywhere, so I think this indicates these connections have no status set by the time that StatusCounter::before_send() runs. Since the route is set, this must be somewhere within instrumented(api(upload::<C>)). I saw no log messages corresponding to these requests. It seems likely this is triggered by some aspect of the messiness of real internet traffic, (our unit tests check status codes) so I think the next diagnostic step is to run a portion of production traffic through a pod with particular trillium log targets turned up, so we can see what handler execution events precede this.

divergentdave commented 5 months ago

I reproduced this behavior in a test harness using trillium-api 0.2.0-rc.11. In fact, if I crank the logging level up to show trace events, I see a "HTTP/1.1 404 Not Found" response is sent out in this situation.

We use the FromConn implementation on Vec<u8> to grab the entire request body. This implementation returns None if reading the request body fails, which can be triggered by the client disconnecting too early before having sent the entire body. Since FromConn produces None, then the API handler function is not called, and the library internals provide an error handler of (), then run it, which leaves the connection unmodified. Control flow bubbles back up to trillium_opentelemetry::metrics::Metrics, which sees no status is set, and records an observation with status code 404. Finally, trillium-http likewise sees no status or body is set, and sends a 404 response.

The root cause is the FromConn implementation swallowing an I/O error. The metrics could be fixed if the error was propagated out, either by smuggling it through state to a before_send() handler method, or better yet if the FromConn implementation for Vec<u8> was replaced with a TryFromConn implementation, which allows returning a separate error handler if a value cannot be extracted. (TryFromConn was introduced later in the 0.2 branch's development) For what it's worth, panicking inside a handler prevents any metrics update from being recorded, but this is probably too noisy.

We could also replace Vec<u8> with a newtype in the API handler function signatures and make the same changes ourselves, so we can write a custom error handler that integrates nicely with janus_aggregator::aggregator::Error, which sets the error_code label on a different (counter) metric.

divergentdave commented 5 months ago

This leaves the question of what status code an error handler could set. Unlike JSON deserialization errors, this is a bit more metaphysical, as we don't expect the response status code to actually reach the client. Trillium's data model only allows the status code to be a real, actual status code with some sort of HTTP semantics. I don't think it would be worth it to change that invariant in order to represent lower level errors here, but I think I've seen other software use a value of "0" in place of a status code in such situations. I think "400 Bad Request" would make sense here, as the server has not received a complete, valid HTTP message.