Clarify the distributed functions

I realized that several users are confused by create_distributed_function and its purpose. I think the main confusion comes from this: create_distributed_table makes things faster, so let's try create_distributed_function to make functions faster. Also, they refer to this blog and probably skimming through it and conclude it makes things faster.

It'd be really nice if we can improve the documentation on this topic so that users do not try to distribute functions unnecessarily / with wrong expectations.

(below copy & paste from an internal e-mail)

What are the typical use cases?

There 3 two main use cases for distributed functions, and as far as I can tell, the function you try to distribute does not fall into any of the categories.

1) The primary use case for distributed functions with distribution_arg_name and colocate_with parameters:

It is mostly useful if you need to scale OLTP workloads that are written as stored procedures.

Say, you have the following stored procedure where all the commands are scoped to per tenant/distribution key:

CREATE FUNCTION procedure_to_pushdown (int dist_key, …)
BEGIN;
       INSERT INTO dist_table (dist_key, …) .. VALUES ($1_;
      UPDATE dist_table … WHERE dist_key = $1;
     DELETE FROM collocated_dist_table WHERE WHERE dist_key = $1;
…
COMMIT;
end function;

If you do not create distributed functions, each statement will become 1 round-trip to the worker node from the coordinator.

Instead, if you convert this function to a distributed function by providing relevant parameters, the procedure call will be routed to the correct node (e.g., we can deduct it from the distribution parameter of the function).

SELECT create_distributed_function(' procedure_to_pushdown (int)', ‘dist_key’, collocate_with:‘distributed_table’);

With that, Citus needs only 1 round-trip to the worker node per procedure call. This might save a lot of round-trips, especially if the function has lots of commands in it.

And then, you get all the benefits described in this blog

2) Secondary use-case for distributed function:

If you have very simple functions that DO NOT return rows, but can only be used in the SELECT clause. Say you have this value_plus_one() function. If you use this function on a distributed query, it’ll use this function on the shards and you’ll get an error (with some logs enabled):

SELECT value_plus_one(key) FROM distributed_table;
….
NOTICE:  issuing SELECT public.value_plus_one(a) AS value_plus_one FROM public.test_102012 test WHERE true
ERROR:  function public.value_plus_one(integer) does not exist
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
CONTEXT:  while executing command on localhost:9705

As a convenience to be able to create the functions on the workers, you can use create_distributed_function without distribution arguments:

SELECT create_distributed_function('value_plus_one(int)');

With that, the function will be available on the worker, and the error can be avoided:

select value_plus_one(a) from test;
value_plus_one
----------------
(0 rows)

Time: 127.368 ms

In other words, this is mostly equivalent for running CREATE FUNCTION .. commands on the workers, no magic happens.

3) Load balancing function calls that are co-located with reference tables

/ todo: fill up later /

citusdata / citus_docs

Clarify the distributed functions #1024

What are the typical use cases?