GCTC-NTGC / gc-digital-talent

GC Digital Talent is the new recruitment platform for digital and tech jobs in the Government of Canada. // Talents numériques du GC est la nouvelle plateforme de recrutement pour les emplois numériques et technologiques au gouvernement du Canada.
https://talent.canada.ca
GNU Affero General Public License v3.0
21 stars 8 forks source link

Explore Backend Telemetry/Analysis #5442

Closed petertgiles closed 1 year ago

petertgiles commented 1 year ago

:question: Spike questions:

Possible approaches:

Timebox: 2 days

petertgiles commented 1 year ago

Postgresql

Nginx

Laravel

Azure AppInsights

Conclusion

petertgiles commented 1 year ago

I was able to install the Nginx Amplify agent in my local Ngnix Docker container. I had to run apt-get update --allow-releaseinfo-change which felt a little "cowboy" for a production server but otherwise it was quick and easy. image Looks like I have a bit more configuration to do but I'm already getting system stats. image

petertgiles commented 1 year ago

I tried adding

        DB::listen(function ($query) {
            Log::info(print_r([
                'sql' => $query->sql,
                'bindings' => $query->bindings,
                'time' => $query->time
            ], true));
        });

and then ran our Cypress suite for some quick-and-dirty analysis.

Our slowest query was a 14s select count(*) as aggregate from "users" where... query with 85 (!!!) bind parameters. Second place at 13s was a simple select * from "users" where "sub" = ? and "users"."deleted_at" is null limit 1 query. I'm not sure why that one was slow. Maybe we should be adding indexes to the deleted_at columns?

I could see us writing queries above some limit to a database in production for analysis. Maybe sqllite or there's something we could spin up in Azure. I'm not sure how to log the Graphql query that produced the SQL though, which is pretty important.

petertgiles commented 1 year ago

There's a whenQueryingForLongerThan function in Laravel, too. monitoring-cumulative-query-time

An alternative approach that would capture the GrqphQL queries is to use a middleware in Lighthouse, like the AuditQueryMiddleware:

    public function handle(Request $request, Closure $next)
    {
        $start = hrtime(true);
        $next($request);
        $end = hrtime(true);

        $this->logger->info(
            'Slow query',
            [
                'request' => $request->json()->all(),
                'elapsed_time' => ($end - $start) / 1000000000
            ]

        );
    }

This doesn't capture what's happening upstream in Laravel or Ngnix but does capture the exact GraphQL query including the OperationName which should make it much easier to identify the context of a slow request.

petertgiles commented 1 year ago

While reading through Laravel optimization sites I noticed some recommendations to removing unused middleware and service providers. Since we only use a small fraction of what Laravel is usually used for, there is probably some room for improvement there for us. https://geekflare.com/laravel-optimization/

petertgiles commented 1 year ago

Some comments from IMTD about monitoring and Amplify:

brindasasi commented 1 year ago

Agree about AppInsights.. we have the analytics for client side only. Server side measurement would be great.

brindasasi commented 1 year ago

There's a whenQueryingForLongerThan function in Laravel, too. monitoring-cumulative-query-time

An alternative approach that would capture the GrqphQL queries is to use a middleware in Lighthouse, like the AuditQueryMiddleware:

    public function handle(Request $request, Closure $next)
    {
        $start = hrtime(true);
        $next($request);
        $end = hrtime(true);

        $this->logger->info(
            'Slow query',
            [
                'request' => $request->json()->all(),
                'elapsed_time' => ($end - $start) / 1000000000
            ]

        );
    }

This doesn't capture what's happening upstream in Laravel or Ngnix but does capture the exact GraphQL query including the OperationName which should make it much easier to identify the context of a slow request.

This can be used for some sort of alert. When this happens its the breaking point and tell us to go look at the query. It solves only partial.

I like this one and this approach sounds familiar to me. Every query has its timings. I'm unsure about the drawbacks you mentioned though.

DB::listen(function ($query) {
            Log::info(print_r([
                'sql' => $query->sql,
                'bindings' => $query->bindings,
                'time' => $query->time
            ], true));
        });

Does the Amplify give graphql request info as well?

petertgiles commented 1 year ago

Does the Amplify give graphql request info as well?

Not that I've seen, no. It only reports aggregated metrics like number of errors, number of slow requests, etc... https://docs.nginx.com/nginx-amplify/metrics-metadata/

brindasasi commented 1 year ago

I think we could use both Laravel logging and Amplify as they solve diff purposes.

petertgiles commented 1 year ago

Set Up Nginx Amplify #6833 Set up Laravel slow query monitoring with request text #6834