drsnyder / poky

PostgreSQL key value store
MIT License
13 stars 1 forks source link

Implement sproc via stored procedure #19

Closed loganlinn closed 11 years ago

loganlinn commented 11 years ago

Instead of generating the query JVM-side, use a stored procedure for the mget query.

Primary motivation for these changes is to use a IN/ANY super-select for the query for performance reasons. There were concerns about complexity of generating queries, so this generates a stored procedure call instead (an arguable trade-off). A benefit of having this as a sproc is it can be used from different clients as well as better access for load testing.

Timestamped queries aren't working quite right yet (where parameter is Timestamp object). I am assuming this is JDBC type inference issue.

drsnyder commented 11 years ago

What about something like this:

CREATE OR REPLACE FUNCTION mget(_values poky[]) RETURNS SETOF poky AS
$$
DECLARE
BEGIN

  RETURN QUERY
  SELECT poky.*
  FROM poky
  JOIN (
    SELECT * FROM UNNEST(_values)
  ) AS in_poky ON (
    poky.bucket = in_poky.bucket AND
    poky.key = in_poky.key AND
    date_trunc('seconds', poky.modified_at) = date_trunc('seconds', in_poky.modified_at)
  );

END;
$$ LANGUAGE PLPGSQL;

It can then be queried using:

SELECT * FROM mget(ARRAY[('bucket', 'key100', NULL, NULL, '2013-05-18 16:43:23.361331+00')::poky]);

It has the odd NULL syntax on input but the body of the function is declarative SQL. We may have to iterate on the structure of the query to make sure the planner is doing what we expect but I think doing so will be very straight forward.

loganlinn commented 11 years ago

Yeah, I was looking into doing something of this direction w/ the ROW constructor, but didn't get too far. Do you know if this has an perf advantage over using EXECUTE? I can see how it would...

loganlinn commented 11 years ago

SELECT * FROM mget(ARRAY[('bucket', 'key100', NULL, NULL, '2013-05-18 16:43:23.361331+00')::poky]);

Not crazy that this apparoach requires knowledge of the column order in schema

drsnyder commented 11 years ago

Not crazy that this apparoach requires knowledge of the column order in schema

We could probably get around that by constructing an array of poky records given an array of (bucket, key, modified_at) tuples.

drsnyder commented 11 years ago

Something like this would work:

CREATE TYPE mget_row AS ( bucket text, key text, modified_at timestamptz );

CREATE OR REPLACE FUNCTION mget(_values mget_row[]) RETURNS SETOF poky AS
...
;
SELECT * FROM mget(ARRAY[('bucket', 'key100', '2013-05-18 16:43:23.361331+00')::mget_row]);
loganlinn commented 11 years ago

Are you of the mindset that knowledge modified_at should be required in order to use this interface? Although it fits our current use-cases, I think it significantly hinders its utility.

drsnyder commented 11 years ago

Are you of the mindset that knowledge modified_at should be required in order to use this interface? Although it fits our current use-cases, I think it significantly hinders its utility.

We could accept a NULL for the modified_at value. In which case any modified_at will do. What do you think about that approach?

loganlinn commented 11 years ago

Sounds good. Would you want to implement, or should I?

drsnyder commented 11 years ago

Hrm. I did some testing and for some reason, the stored procedure is ~20x slower than the query. Investigating.

drsnyder commented 11 years ago

Ok, writing the function as SQL is a lot more performant. Here is the version I currently have:

CREATE OR REPLACE FUNCTION mget(_bucket text, _values mget_param_row[]) RETURNS TABLE
(
        bucket varchar(256),
        key    varchar(1024),
        data   text,
        created_at timestamptz,
        modified_at timestamptz
) AS
$$
        WITH mget_poky AS (
                SELECT * FROM UNNEST(_values)
        ),
        poky AS (
                SELECT *
                FROM poky
                WHERE bucket = _bucket AND
                key IN (SELECT key FROM mget_poky)
        )
        SELECT poky.bucket, poky.key, poky.data, poky.created_at, poky.modified_at
        FROM poky
        JOIN mget_poky ON (
                mget_poky.key = poky.key AND
                (( mget_poky.modified_at IS NULL) OR date_trunc('seconds', poky.modified_at) = date_trunc('seconds', mget_poky.modified_at) )
        );
$$ LANGUAGE SQL STABLE CALLED ON NULL INPUT;

It's about 20x faster than the plpgsql version (over an iteration of 1000).

drsnyder commented 11 years ago

I can take a stab at adding it. I also want to try and productize the partitioning.

loganlinn commented 11 years ago

Thanks for the SQL sproc... I like that much better