Closed divergentdave closed 5 months ago
I reproduced this behavior in a test harness using trillium-api 0.2.0-rc.11. In fact, if I crank the logging level up to show trace
events, I see a "HTTP/1.1 404 Not Found" response is sent out in this situation.
We use the FromConn
implementation on Vec<u8>
to grab the entire request body. This implementation returns None
if reading the request body fails, which can be triggered by the client disconnecting too early before having sent the entire body. Since FromConn
produces None
, then the API handler function is not called, and the library internals provide an error handler of ()
, then run it, which leaves the connection unmodified. Control flow bubbles back up to trillium_opentelemetry::metrics::Metrics
, which sees no status is set, and records an observation with status code 404. Finally, trillium-http
likewise sees no status or body is set, and sends a 404 response.
The root cause is the FromConn
implementation swallowing an I/O error. The metrics could be fixed if the error was propagated out, either by smuggling it through state to a before_send()
handler method, or better yet if the FromConn
implementation for Vec<u8>
was replaced with a TryFromConn
implementation, which allows returning a separate error handler if a value cannot be extracted. (TryFromConn
was introduced later in the 0.2 branch's development) For what it's worth, panicking inside a handler prevents any metrics update from being recorded, but this is probably too noisy.
We could also replace Vec<u8>
with a newtype in the API handler function signatures and make the same changes ourselves, so we can write a custom error handler that integrates nicely with janus_aggregator::aggregator::Error
, which sets the error_code
label on a different (counter) metric.
This leaves the question of what status code an error handler could set. Unlike JSON deserialization errors, this is a bit more metaphysical, as we don't expect the response status code to actually reach the client. Trillium's data model only allows the status code to be a real, actual status code with some sort of HTTP semantics. I don't think it would be worth it to change that invariant in order to represent lower level errors here, but I think I've seen other software use a value of "0" in place of a status code in such situations. I think "400 Bad Request" would make sense here, as the server has not received a complete, valid HTTP message.
In production, we are seeing occasional requests in the
sum(rate(http_server_request_duration_seconds_count{http_response_status_code="404",http_route="/tasks/:task_id/reports"}[2m]))
time series, at approximately one per minute. This has held steady across our entire metrics lookback period. These correlate perfectly withsum(rate(janus_aggregator_responses_total{error_code="unknown",route="/tasks/:task_id/reports"}[2m]))
. I don't think we explicitly set a status code of 404/Status::NOT_FOUND anywhere, so I think this indicates these connections have no status set by the time thatStatusCounter::before_send()
runs. Since the route is set, this must be somewhere withininstrumented(api(upload::<C>))
. I saw no log messages corresponding to these requests. It seems likely this is triggered by some aspect of the messiness of real internet traffic, (our unit tests check status codes) so I think the next diagnostic step is to run a portion of production traffic through a pod with particular trillium log targets turned up, so we can see what handler execution events precede this.