PPD query timeout not being handled correctly

ijdickinson commented 2 years ago

The query town = Plymouth & limit = all on PPD produces results on production, but a 502 error on hmlr-dev-pres.

There are a number of issues here:

this query succeeds in production, so presumably production has some timeout values that are larger than on the dev service
the client app responded with a proxy error page, not something that is helpful to the user (see below)
looking at the logs in Kibana, it's hard (for me) to tell which service is actually raising the timeout
I don't see in Kibana any SapiNT or Fuseki log messages for the query
tying the log messages together with a common request ID doesn't seem to be working

The user sees:

Screenshot from 2022-01-10 13-29-14

We may want to treat this as a coordinating issue, and add specific tickets for elements of the investigation & resolution. Hence I'm adding a whole collection of people as assignees.

ijdickinson commented 2 years ago

Discovered during QA testing #64

der commented 2 years ago

Ignoring the kibana configuration issues for now and just look at raw logs to diagnose the timeout ...

tl;dr it seems to be either puma or the front end proxy, need ppd logs working to get more info

A similar looking query completed on fuseki at 12:04:24 so I assume that's the one. It took 92s. Our current production config limits are 90s to first result 120s to full result so this would work on production but it's pushing the limit. We have the same limits in dev.

The current dev dataserver is a smaller machine than full production. We will scale up to the full machine sizes when we are ready to do performance testing but not until everything else is ready - they are just too expensive to keep around for long. So it's possible that 92s would have been less on a production scale machine. Nevertheless this time the fuseki query did return 200 OK so it wasn't fuseki timing out.

There's a corresponding proxy log entry on the dev data server which shows that responded with a 200.

The dev data load balancer is configured with an idle timeout of 120s which should work, and the cloudwatch graphs for data LB don't shown 5xx errors at that time, though those metrics are suspect (see below). So I believe the data chain from load balancer to data proxy to fuseki and back all worked.

This is confirmed by the data-api logs which show that query finishing at 12:04:24 returning 132955 rows with the query having taken 92732ms so it looks like the API to fuseki and back worked.

The ppd app isn't generating any logs at all so I can't see what was happening between the ppd app and the data-api.

The pres proxy logs do show returning a 502 which want looks like roughly a 60s timeout.

So either the connection between the front proxy and puma is timing out or puma itself is dropping that connection after 60s. Without any ppd logs it's hard to tell.

However, the cloudwatch graphs for the hmlr-dev-pres don't show any HTTP or ELB 5XX errors either which is mysterious and unhelpful. Was the test definitely through the hmlr-dev-pres load balancer and not direct to the pres server?

ijdickinson commented 2 years ago

Was the test definitely through the hmlr-dev-pres load balancer and not direct to the pres server?

I was using http://hmlr-dev-pres.epimorphics.net as the domain, so afaik that's to the LB not directly to the server

der commented 2 years ago

Looking more closely at the apache2 httpd config in pres I think the issue is that we are using a rewrite rule for the proxy to the ppd/puma app because we have to rewrite the URL to remove the base /app/ppd. If it wasn't for that we could use a ProxyPass directive which supports configuration of timeouts, where as rewrite rules don't offer that option.

If there's no way to change the base URL in the rails apps then we may have to configure the proxy timeout globally rather.

der commented 2 years ago

I was using http://hmlr-dev-pres.epimorphics.net as the domain, so afaik that's to the LB not directly to the server

Thanks for confirming. Looks like the cloudwatch metric in the AWS console missed it but the grafana cloudwatch dashboard in our own monitoring did catch it (and confirms the backend proxy was clean).

der commented 2 years ago

Changed TimeOut to 150s for the data and pres vhosts, user query still times out after 60s.

der commented 2 years ago

Correction, timeout changes hadn't been applied properly. Now applied that this query no longer times out.

epimorphics / hmlr-linked-data

PPD query timeout not being handled correctly #74