Only first statement in multi-statement is traced

hlinnaka commented 2 months ago

Steps to reproduce the issue

$ psql postgres -c "select count(*) from pg_am; select count(*) from pg_class"
 count 
-------
     7
(1 row)

 count 
-------
   420
(1 row)

$ psql postgres -c "select trace_id, span_id, span_type, span_operation from pg_tracing_consume_spans"
             trace_id             |     span_id      |  span_type   |       span_operation        
----------------------------------+------------------+--------------+-----------------------------
 266759934ee85fe1100bf7a07bd43421 | 1b40b22bc5da4fa8 | Planner      | Planner
 266759934ee85fe1100bf7a07bd43421 | 88167138ea5945e5 | Executor     | ExecutorRun
 266759934ee85fe1100bf7a07bd43421 | d10fe19e7702ecd0 | SeqScan      | SeqScan on pg_am
 266759934ee85fe1100bf7a07bd43421 | 85e37a824bf3962b | Aggregate    | Aggregate
 266759934ee85fe1100bf7a07bd43421 | 7ffdd4d4338ba069 | Select query | select count(*) from pg_am;
(5 rows)

There are no spans for the second query, select count(*) from pg_class.

hlinnaka commented 2 months ago

I bumped into this while I was reading the code and trying to understand how the spans are created and closed, how the nesting levels work etc. It all feels pretty complicated. I'm not sure what exactly we should do here, but I think the current mapping to spans is not great, and even if we fix this case, we'll have a lot of problems like this.

Some problems (aside from this particular issue with multi-statement) that I've identified so far:

There is no span for parse-analyze phase. It's usually quick, but in some cases it might not be
issues #10 and #13
There is no span active when the transaction is committed. The time to flush the WAL to disk can be significant, even more so if you use synchronous replication.
How to attach a trace parent if you use prepared statements? You can have an SQL comment on the PREPARE and EXECUTE statement, but I don't think that will do the right thing. Also, with extended query protocol, there is no query text with the Bind+Execute messages, so there's nowhere to put the special SQL comment.

I'm not sure what exactly to do here, but I think the nesting level tracking and how the parse, plan, execute hook function are mapped to tracing context and spans needs some more thinking.

bonnefoa commented 1 month ago

I bumped into this while I was reading the code and trying to understand how the spans are created and closed, how the nesting levels work etc. It all feels pretty complicated.

Agreed, I'm not satisfied with the current complexity and still trying to simplify this but haven't found a good solution yet. AFAIK, there's no easy entry point where to start a span. Most queries will go through the post_parse hook first but a prepared statement won't and may start with the planner hook (directly ExecutorStart hook if it's a cached plan). Transaction block + extended protocol makes things worse since the parsing of the next statement happens while the previous statement is still ongoing (PortalDrop happens when processing the Bind message).

Though for the problem you're reporting, I wasn't able to reproduce it. I do see both statements being traced. What was the pg_tracing.sample_rate used and on which Postgres version?

psql -c "select count(*) from pg_am; select count(*) from pg_class"

psql postgres -c "select trace_id, span_id, span_type, span_operation from pg_tracing_consume_spans"
             trace_id             |     span_id      |   span_type   |                          span_operation
----------------------------------+------------------+---------------+-------------------------------------------------------------------
 b759fe638dc7e1d85357ec926ae7732c | c524f9b23fe8fe6c | Planner       | Planner
 b759fe638dc7e1d85357ec926ae7732c | e5e10e789d9e3f5a | Executor      | ExecutorRun
 b759fe638dc7e1d85357ec926ae7732c | 810690be67920433 | SeqScan       | SeqScan on pg_am
 b759fe638dc7e1d85357ec926ae7732c | 5f98c17515191298 | Aggregate     | Aggregate
 b759fe638dc7e1d85357ec926ae7732c | c4fe67771f18ec07 | Select query  | select count(*) from pg_am;
 b759fe638dc7e1d85357ec926ae7732c | 11c9351b49b872e2 | Planner       | Planner
 b759fe638dc7e1d85357ec926ae7732c | 4ada72970aee8e0f | Executor      | ExecutorRun
 b759fe638dc7e1d85357ec926ae7732c | cf47a615cc790b52 | IndexOnlyScan | IndexOnlyScan using pg_class_tblspc_relfilenode_index on pg_class
 b759fe638dc7e1d85357ec926ae7732c | 34ba1c4550f29542 | Aggregate     | Aggregate
 b759fe638dc7e1d85357ec926ae7732c | d76f6eaf42b900a3 | Select query  | select count(*) from pg_am; select count(*) from pg_class

There is no span for parse-analyze phase. It's usually quick, but in some cases it might not be

I used to generate a span for parse-analyze but it was mostly a guess by assuming that the time between stmtStartTimestamp and post_parse was spent for parse-analyze. I eventually dropped it as it added a lot of complexity and wasn't reliable. A better solution would be to have a parse-analyze hook which also could be used by other extensions like pg_stat_statements.

There is no span active when the transaction is committed. The time to flush the WAL to disk can be significant, even more so if you use synchronous replication.

Good point, I didn't realise commit happened after portal was dropped. That might be doable by using xact callbacks and creating a span between XACT_EVENT_PRE_COMMIT and XACT_EVENT_COMMIT.

How to attach a trace parent if you use prepared statements? You can have an SQL comment on the PREPARE and EXECUTE statement, but I don't think that will do the right thing.

Prepared statements executed through EXECUTE and PREPARE look ok. Tracing PREPARE will show the ProcessUtility call while tracing EXECUTE will show the prepare utility then the nested query.

/*dddbs='postgres.db',traceparent='00-00000000000000000000000000000001-0000000000000001-01'*/ PREPARE test_prepared_one_param (integer) AS SELECT * FROM pgbench_accounts where aid=\$1;
/*dddbs='postgres.db',traceparent='00-00000000000000000000000000000002-0000000000000002-01'*/ EXECUTE test_prepared_one_param(200);

WITH oldest_start AS (
    SELECT min(span_start) as min_start
    FROM pg_tracing_peek_spans
)
select
    extract(MICROSECONDS FROM age(span_start, oldest_start.min_start)) as us_start,
    extract(MICROSECONDS FROM age(span_end, oldest_start.min_start)) as us_end,
    lvl, span_type, span_operation
    FROM peek_ordered_spans, oldest_start;

 us_start | us_end | lvl |   span_type    |                                      span_operation
----------+--------+-----+----------------+-------------------------------------------------------------------------------------------
        0 |    317 |   1 | Utility query  | PREPARE test_prepared_one_param (integer) AS SELECT * FROM pgbench_accounts where aid=$1;
        9 |    317 |   2 | ProcessUtility | ProcessUtility
      277 |    317 |   3 | Select query   | PREPARE test_prepared_one_param (integer) AS SELECT * FROM pgbench_accounts where aid=$1;
      534 |   1021 |   1 | Utility query  | EXECUTE test_prepared_one_param(200);
      541 |   1021 |   2 | ProcessUtility | ProcessUtility
      565 |   1017 |   3 | Select query   | PREPARE test_prepared_one_param (integer) AS SELECT * FROM pgbench_accounts where aid=$1;
      565 |    943 |   4 | Planner        | Planner
      967 |    986 |   4 | Executor       | ExecutorRun
      969 |    985 |   5 | IndexScan      | IndexScan using pgbench_accounts_pkey on pgbench_accounts

Also, with extended query protocol, there is no query text with the Bind+Execute messages, so there's nowhere to put the special SQL comment.

That's definitely been a pain point. Early versions relied on a hack where you could pass SQLCommenter as a parameter

PREPARE test_prepared (text, integer) AS /*$1*/ SELECT $2;
EXECUTE test_prepared('dddbs=''postgres.db'',traceparent=''00-00000000000000000000000000000009-0000000000000009-01''', 1);

It worked but it definitely wasn't a great experience so I dropped it. A better alternative would be to propagate the tracecontext through GUC, either explicitly or with Jelte's proposition.

DataDog / pg_tracing

Only first statement in multi-statement is traced #12

Steps to reproduce the issue