Redshift Compatibility - pg_stat_activity

adamprescott commented 2 years ago

Hi, Currently looking at using this library to also connect to Redshift from Lambda, Redshift is wire compatible with Postgres though a few of the system tables have slightly different and missing columns compared to what this library expects, AWS also add other tables which can be used to infer idle sessions.

If no one else is considering this, or working on it, I'm happy to add a PR for this feature/enhancement.

Mainly pg_stat_activity doesn't have all the columns this library expects and there are different ways to identify idle sessions in Redshift.

MatteoGioioso commented 2 years ago

Hello, interesting feature, but it may require a bit of refactoring first. I never used Redshift, could you please show me what is your plan so I can have an idea (what queries are needed, what needs to be modified, ect...) ? Thanks

adamprescott commented 2 years ago

Sure here are a few things I've identified:

Redshift Clusters have a hard 500 connections limit (despite what's in pg_database_info) - though there are mitigations around this via other tooling.
Can infer if we're connecting to a Redshift DB by running select version() and seeing if .includes('Redshift')
pg_stat_activity looks slightly different in Redshift, with columns datid, datname, procpid, usesysid, usename, current_query, query_start current_query and query_start aren't usually populated unless configured on. We're also missing application_name
pg_terminate_backend is a valid means of killing idle sessions
Redshit has various other tables containing connection, session, and query information

Example query for identifying idle connections, the main issue with this query is it can take 7 secs~:

with sessions as (SELECT starttime, process, u.usesysid, user_name
        FROM stv_sessions s, pg_user u
        WHERE s.user_name = u.usename
        AND u.usesysid>1
        AND process NOT IN (
            SELECT pid FROM stv_inflight WHERE userid>1
            UNION SELECT pid FROM stv_recents WHERE status != 'Done' AND userid>1
        )), queries as (
            SELECT userid, pid, max(endtime) AS last_end FROM svl_statementtext
            WHERE userid>1 AND sequence=0
            GROUP BY 1,2)

SELECT
    sessions.process AS pid,
    trim(sessions.user_name) AS user_name,
    sessions.starttime AS session_start_time,
    datediff(s, sessions.starttime, sysdate) AS session_duration_is_seconds,
    queries.last_end AS last_query_end_time,
    datediff(s, CASE WHEN queries.last_end IS NOT NULL
        THEN queries.last_end
        ELSE sessions.starttime END
    , sysdate) AS idle_duration_in_seconds
FROM sessions
        LEFT OUTER JOIN queries
            ON sessions.usesysid = queries.userid AND sessions.process = queries.pid
        WHERE (queries.last_end > sessions.starttime OR queries.last_end IS NULL)
        ORDER BY idle_duration_in_seconds DESC;

Implementation

I think there are two viable options:

Fork the repo as a new serverless-redshift project and make the changes needed to the queries
Rework the Strategy into a plugin based approach initially covering Postgres and Redshift but could be later extended by others to include, Google Cloud Spanner, CockroachDB, YugabyteDB, etc...

MatteoGioioso commented 2 years ago

Ok, I think we could proceed with the option 2. I will try to assess the viability of abstracting the queries, if it is too complex or tedious we can proceed with the first option.

Can infer if we're connecting to a Redshift DB by running select version() and seeing if .includes('Redshift')

I guess for this we could just add another option on the config, for example compatibility: 'Redshift'.

MatteoGioioso commented 2 years ago

@adamprescott I have made a draft PR, please take a look at it: https://github.com/MatteoGioioso/serverless-pg/pull/81

This is the proposed interface:

interface Plugin {
  getIdleProcessesListByMinimumTimeout(self: ServerlessClient): Promise<NodePgClientResponse<ProcessList>>;

  getIdleProcessesListOrderByDate(self: ServerlessClient): Promise<NodePgClientResponse<ProcessList>>;

  processCount(self: ServerlessClient): Promise<NodePgClientResponse<Count>>;

  killProcesses(self: ServerlessClient, pids: string[]): Promise<NodePgClientResponse<any>>;

  showMaxConnections(self: ServerlessClient): Promise<NodePgClientResponse<MaxConnections>>;
}

Which you could use it like this:

 const client = new ServerlessClient({
      user: "redshift",
      host: "localhost",
      database: "redshift",
      password: "redshift",
      port: 5439
      plugin: new RedshiftServerlessPlugin(someObject);
    });

this way you have complete freedom in what parameters you would like to give to your query. You can also tap into any configuration values using the self parameter.

Example function:

  async getIdleProcessesListByMinimumTimeout(serverlessPgSelf) {
    const query = `
        WITH processes AS (SELECT EXTRACT(EPOCH FROM (Now() - state_change)) AS idle_time,
                                  pid
                           FROM pg_stat_activity
                           WHERE usename = $1
                             AND datname = $2
                             AND state = 'idle'
                             AND application_name = $5)
        SELECT pid
        FROM processes
        WHERE idle_time > $3
            LIMIT $4;`;

    const values = [
      serverlessPgSelf._client.user,
      serverlessPgSelf._client.database,
      serverlessPgSelf._strategy.minConnIdleTimeSec,
      serverlessPgSelf._strategy.maxIdleConnectionsToKill,
      serverlessPgSelf._application_name
    ];

    return serverlessPgSelf._client.query(query, values);
  }

Let me know what you think, Thanks!

EDITED

MatteoGioioso / serverless-pg

Redshift Compatibility - pg_stat_activity #79

Implementation