GCTC-NTGC / gc-digital-talent

GC Digital Talent is the new recruitment platform for digital and tech jobs in the Government of Canada. // Talents numériques du GC est la nouvelle plateforme de recrutement pour les emplois numériques et technologiques au gouvernement du Canada.

https://talent.canada.ca

GNU Affero General Public License v3.0

21 stars 8 forks source link

Explore Backend Telemetry/Analysis #5442

Closed petertgiles closed 1 year ago

petertgiles commented 1 year ago

:question: Spike questions:

[ ] How can we measure which requests take a lot of time in our API?
[ ] How can we measure which requests occur with high frequency?

Possible approaches:

Azure AppInsights
Nginx logging
Laravel logging
Postgresql statistics

Timebox: 2 days

petertgiles commented 1 year ago

Postgresql

https://pgdash.io/blog/slow-queries-postgres.html
pg_stat_statementsis available but not active on our server
- Would require a configuration change and reload on server (IMTD help)
- Very low-level analysis would be required. Maybe we would want to try this out locally first.
We can query the pg_stat_activity view right now but only shows current connections. Limited usefulness.

Nginx

https://www.nginx.com/blog/using-nginx-logging-for-application-performance-monitoring/#Using-the-NGINX-Built%E2%80%91In-Timing-Variables
- You can log timings at Nginx but without the query body I don't know how useful it will be. Unlike REST, we really need the Graphql request bodies to be useful.
- F5 has a monitoring tool call Amplify that can monitor up to 5 servers for free. I know IMTD uses F5 load balancers upstream - maybe we're licensed for this already? https://www.nginx.com/blog/monitoring-nginx/

Laravel

https://laravel.io/articles/how-to-find-the-slowest-query-in-your-application
- This looks like a simple, useful way to log slow queries. Getting logging data out of Laravel is pretty easy.
https://github.com/ilzrv/laravel-slow-query-detector
- Similar idea rolled into an extension.

Azure AppInsights

I found a performance tool but it seems to have limited usefulness for us. On one hand, this is the only client-side tool we have. On the other hand it seems to not work well with an SPA as most paths show a 0.00 ms time duration. Also, the the browser URL and page load time is less interesting to us than the GrqphQL queries and timings.
The hitcount per path statistic is very interesting though.

Conclusion

I'm going to look into some simple Laravel logging and ask IMTD about Amplify.

petertgiles commented 1 year ago

I was able to install the Nginx Amplify agent in my local Ngnix Docker container. I had to run apt-get update --allow-releaseinfo-change which felt a little "cowboy" for a production server but otherwise it was quick and easy. Looks like I have a bit more configuration to do but I'm already getting system stats.

petertgiles commented 1 year ago

I tried adding

        DB::listen(function ($query) {
            Log::info(print_r([
                'sql' => $query->sql,
                'bindings' => $query->bindings,
                'time' => $query->time
            ], true));
        });

and then ran our Cypress suite for some quick-and-dirty analysis.

Our slowest query was a 14s select count(*) as aggregate from "users" where... query with 85 (!!!) bind parameters. Second place at 13s was a simple select * from "users" where "sub" = ? and "users"."deleted_at" is null limit 1 query. I'm not sure why that one was slow. Maybe we should be adding indexes to the deleted_at columns?

I could see us writing queries above some limit to a database in production for analysis. Maybe sqllite or there's something we could spin up in Azure. I'm not sure how to log the Graphql query that produced the SQL though, which is pretty important.

petertgiles commented 1 year ago

There's a whenQueryingForLongerThan function in Laravel, too. monitoring-cumulative-query-time

An alternative approach that would capture the GrqphQL queries is to use a middleware in Lighthouse, like the AuditQueryMiddleware:

    public function handle(Request $request, Closure $next)
    {
        $start = hrtime(true);
        $next($request);
        $end = hrtime(true);

        $this->logger->info(
            'Slow query',
            [
                'request' => $request->json()->all(),
                'elapsed_time' => ($end - $start) / 1000000000
            ]

        );
    }

This doesn't capture what's happening upstream in Laravel or Ngnix but does capture the exact GraphQL query including the OperationName which should make it much easier to identify the context of a slow request.

petertgiles commented 1 year ago

While reading through Laravel optimization sites I noticed some recommendations to removing unused middleware and service providers. Since we only use a small fraction of what Laravel is usually used for, there is probably some room for improvement there for us. https://geekflare.com/laravel-optimization/

petertgiles commented 1 year ago

Some comments from IMTD about monitoring and Amplify:

seems like Ampify is free
they use PRTG Network Monitoring themselves
Open Canada uses Prometheus with Graphana already

brindasasi commented 1 year ago

Agree about AppInsights.. we have the analytics for client side only. Server side measurement would be great.

brindasasi commented 1 year ago

There's a whenQueryingForLongerThan function in Laravel, too. monitoring-cumulative-query-time

An alternative approach that would capture the GrqphQL queries is to use a middleware in Lighthouse, like the AuditQueryMiddleware:
    public function handle(Request $request, Closure $next)
    {
        $start = hrtime(true);
        $next($request);
        $end = hrtime(true);

        $this->logger->info(
            'Slow query',
            [
                'request' => $request->json()->all(),
                'elapsed_time' => ($end - $start) / 1000000000
            ]

        );
    }
This doesn't capture what's happening upstream in Laravel or Ngnix but does capture the exact GraphQL query including the OperationName which should make it much easier to identify the context of a slow request.

This can be used for some sort of alert. When this happens its the breaking point and tell us to go look at the query. It solves only partial.

I like this one and this approach sounds familiar to me. Every query has its timings. I'm unsure about the drawbacks you mentioned though.

DB::listen(function ($query) {
            Log::info(print_r([
                'sql' => $query->sql,
                'bindings' => $query->bindings,
                'time' => $query->time
            ], true));
        });

Does the Amplify give graphql request info as well?

petertgiles commented 1 year ago

Does the Amplify give graphql request info as well?

Not that I've seen, no. It only reports aggregated metrics like number of errors, number of slow requests, etc... https://docs.nginx.com/nginx-amplify/metrics-metadata/

brindasasi commented 1 year ago

I think we could use both Laravel logging and Amplify as they solve diff purposes.

petertgiles commented 1 year ago

Set Up Nginx Amplify #6833 Set up Laravel slow query monitoring with request text #6834