Clarify confusing PL/pgSQL error message

citusdata / pg_shard

ATTENTION: pg_shard is superseded by Citus, its more powerful replacement

https://github.com/citusdata/citus

GNU Lesser General Public License v3.0

1.06k stars 63 forks source link

Clarify confusing PL/pgSQL error message #121

Closed jasonmp85 closed 9 years ago

jasonmp85 commented 9 years ago

When a user attempts to interact with a pg_shard-distributed table using bare SQL in PL/pgSQL, they receive a confusing error message of unrecognized node type: 2100. We've had several users raise this issue and it's certainly possible additional customers just think PL/pgSQL support is broken entirely and haven't said anything.

The underlying issue is that PL/pgSQL caches plans for all static SQL appearing within functions. pg_shard uses a "custom" plan type (just an integer tag well above those used by PostgreSQL) so the caching methods fail. In the normal course of things we intercept e.g. PREPARE statements to keep users from caching plans, but PL/pgSQL bypasses that path and uses SPI's caching machinery directly.

I looked briefly into cramming the distributed plan into some other plan that can be cached and even thought about using queryId as some sort of index into a place we can actually pass along the plan during query processing, but ultimately those approaches leave a lot to be desired and felt all-around hacky.

After that I tried using a transaction callback to replace the confusing error message, but it's unsafe to call CopyErrorData if no error has occurred, so an explicit ROLLBACK statement could trigger my callback even when no error has occurred.

Finally, I learned of emit_log_hook, which would make it trivial to replace the error message, except it only executes if a message is destined for the log. If—for whatever reason—a system is configured to not send ERROR to the log destination, the hook is not invoked, leaving the misleading error in place.

So I landed upon using a combination of error_context_stack (actually a chain of handlers called during error handling) and PLpgSQL_plugin (intended to facilitate PL/pgSQL debugger and profiler development). The handler simply detects the error code and message of this misleading case and augments the error information with hints, details, and a better code and message. This solution has the benefit that it does not (yet) change our execution/caching paths and replaces all instances of the offending message rather cleanly (or so I think).

jasonmp85 commented 9 years ago

@anarazel — since I discussed this with you two weeks ago I wanted to assign this review to you. Turns out you aren't a member of our org yet (oops), so I sent you an invite.

anarazel commented 9 years ago

Thanks for the invite...

I find that approach pretty darn ugly - but you're right that it's a relatively non-invasive bandaid. I think there's a bunch of places that use slightly different messages, but I don't think any of those are in likely paths.

WRT the concrete patch, I see some issues:

This will fail horribly in a recursive plpgsql function call. As you use a single global variable for the error context stack that variable will be in two places in the stack, with a borked .previous in the outer instance.
You're not unregistering the error context. That happens to more or less work out ok because plpgsql resets the error context stack to the last value it knows, but that seems awfully fragile. Because of the above you'll need to do dynamic memory allocations anyway, so you could just unregister things in the func_end callback.
PluginFuncs should probably be const.
You're theoretically not supposed to do anything but setup additional context information in a context callback. This approach will probably not work correctly for GetErrorContextStack() callers (i.e. plpgsql itself, if you use GET STACKED DIAGNOSTICS to get the error message). That's probably enough of an edge case not to worry about for now.

I guess a revision of this will do for now.

jasonmp85 commented 9 years ago

This will fail horribly in a recursive plpgsql function call. As you use a single global variable for the error context stack that variable will be in two places in the stack, with a borked .previous in the outer instance.

Gah. I had realized this when thinking about the problem but forgot to actually deal with it in the change. I think I realized it might be sufficient to peek the stack and just not do anything if the top is already my handler, but I'll have to check. I'll address it either way.

You're not unregistering the error context. That happens to more or less work out ok because plpgsql resets the error context stack to the last value it knows, but that seems awfully fragile. Because of the above you'll need to do dynamic memory allocations anyway, so you could just unregister things in the func_end callback.

Yeah, I had inspected every site the plugin is called in 9.3/9.4 and realized it will pop the stack back for me (I have a comment to this effect), but you're right: if I have to do allocations anyways, I might as well pop the stack myself.

PluginFuncs should probably be const.

Sure.

You're theoretically not supposed to do anything but setup additional context information in a context callback.

Yeah, that was clear from the comments describing what this stack is for, but from what I could tell there weren't too many ill effects from violating that assumption.

This approach will probably not work correctly for GetErrorContextStack() callers (i.e. plpgsql itself, if you use GET STACKED DIAGNOSTICS to get the error message). That's probably enough of an edge case not to worry about for now.

You're talking about if I catch an exception in PL/pgSQL and then call GET STACKED DIAGNOSTICS in the handler, right? I'll test it to see.

jasonmp85 commented 9 years ago

Turns out PluginFuncs can't be const: the error_callback and assign_expr fields are set during runtime by plpgsql_estate_setup.

jasonmp85 commented 9 years ago

@anarazel I've addressed your feedback. Can you give this another look and tell me whether it's good for a :shipit:?

In particular:

I now dynamically allocate an ErrorContextCallback during function entrance. I am doing so in TopTransactionContext because I wasn't sure whether just calling palloc directly was OK in whatever the calling context is. Is this overcautious? At any rate it better handles functions calling other functions, so that's an improvement.
I unregister the context at function exit. This is a blind pop (i.e. if someone resets the stack to something lower than the element I added, then I'll pop that instead of returning the stack to the state it was in before my push), but every place these plugin functions are called can blindly pop safely (i.e. the caller pops after calling my function exit handler, so it's OK).
PluginFuncs cannot be const as the PL/pgSQL machinery sets some of its fields before execution

I don't actually see any ill effects of messing with non-CONTEXT fields in this handler, even when using GET STACKED DIAGNOSTICS. Perhaps I misunderstood what you mean, but consider the following (foo is a pg_shard-distributed table):

DO $sharded_sql$
  BEGIN
      DECLARE
          msg_text text;
      BEGIN
          SELECT COUNT(*) FROM foo;
      EXCEPTION
      WHEN feature_not_supported THEN 
          GET STACKED DIAGNOSTICS msg_text = MESSAGE_TEXT;
          RAISE NOTICE '%', msg_text;
      END;
  END
$sharded_sql$;
# NOTICE:  00000: cannot cache distributed plan
# LOCATION:  exec_stmt_raise, pl_exec.c:3074

As you can see the replaced message works just fine.

Is this ready to ship?

anarazel commented 9 years ago

I now dynamically allocate an ErrorContextCallback during function entrance. I am doing so in TopTransactionContext because I wasn't sure whether just calling palloc directly was OK in whatever the calling context is. Is this overcautious? At any rate it better handles functions calling other functions, so that's an improvement.

Hm. I think that should be fine.

I don't actually see any ill effects of messing with non-CONTEXT fields in this handler, even when using GET STACKED DIAGNOSTICS. Perhaps I misunderstood what you mean, but consider the following (foo is a pg_shard-distributed table):
DO $sharded_sql$
  BEGIN
      DECLARE
          msg_text text;
      BEGIN
          SELECT COUNT(*) FROM foo;
      EXCEPTION
      WHEN feature_not_supported THEN 
          GET STACKED DIAGNOSTICS msg_text = MESSAGE_TEXT;
          RAISE NOTICE '%', msg_text;
      END;
  END
$sharded_sql$;
# NOTICE:  00000: cannot cache distributed plan
# LOCATION:  exec_stmt_raise, pl_exec.c:3074

Good. I just wasn't sure it'd work. If you feal overzealous you could add the above as a regression test ;)

Hm. One more comment:

/* get current error data */
errorContext = MemoryContextSwitchTo(TopTransactionContext);
errorData = CopyErrorData();
MemoryContextSwitchTo(errorContext); +
/* if it's an error about a distributed plan, clean up the fields */
if (strcmp("unrecognized node type: 2100", errorData->message) == 0)
{
errcode(ERRCODE_FEATURE_NOT_SUPPORTED);
errmsg("cannot cache distributed plan");
errdetail("PL/pgSQL's statement caching is unsupported by pg_shard.");
errhint("Bypass caching by using the EXECUTE keyword instead.");
}

The error context is special because it has 'memory reserves' in case of out of memory errors and such. So it'd be nicer if we didn't have to switch to a different context during error processing. But I guess we don't really have a choice in this specific case because there's no way to get error message otherwise, and out of memory errors will take a different path (due to the sqlstate check) anyway.

So this is probably the best we can do.

So yes, looks ready to me.

jasonmp85 commented 9 years ago

@anarazel, I found a "fun" bug that this change introduces:

If the user doesn't have the pg_shard extensions preloaded, it gets loaded lazily the first time a C method is referenced. Another piece of code introduced an INSERT trigger that calls a C function in pg_shard. When an INSERT to that table is made, the following happens:

The (PL/pgSQL-based) trigger is invoked
The "on function entry" callback isn't called as the extension isn't loaded
A C-based function is called by the trigger, lazily loading the extension
When the trigger exits, the "on function exit" callback is invoked (the extension is now loaded)
The backend crashes when the "on function exit" calls pfree, since the matching "on function entry" call was never made

Obviously this is something I'll look out for in the future, but it was surprising behavior. It would be nice to have some way of saying "this extension must be preloaded" and exiting out (maybe detecting whether _PG_init is being called during an open transaction? I haven't explored). Just figured I'd drop a mention in here since it took a bit to figure out why the error occurred (it showed up in Travis, and only then when running CitusDB, not for PostgreSQL 9.3 or 9.4…)

anarazel commented 9 years ago

What do you exactly mean with "preloaded"? Shared_preload_libraries? If so, that's relatively straightforward to detect. Just add something like the follwing to _PG_init():

if (!process_shared_preload_libraries_in_progress)
    ereport(ERROR,
                (errcode(ERRCODE_CONFIG_FILE_ERROR),
                              errmsg("XXX can only be loaded via shared_preload_libraries")));