Support direct shard COPY

marcocitus commented 3 years ago

The process utility hook gives us some liberty to deal with non-existent names in utility commands such as COPY.

That means we could support commands that copy directly to/from shards via the coordinator, e.g.:

COPY data_102008 FROM STDIN;
COPY data_102008 TO STDOUT;

If the table does not exist, the coordinator could check whether it matches a shard name and redirect the COPY to the shard.

This would be helpful for migrations to and from Citus.

aykut-bozkurt commented 1 year ago

One of the customers asked for the scalable migration from citus to s3 buckets in csv or parquet format. Then I tried to give a prototype scalable 'copy distributed_table to program'. The udf scalable_copy_to_csv accepts a table name and an output folder to persist csv files. It basically pushes the copy <dist_table> to program 'bash s3-uploader.sh <output folder>'command on all workers and does the copying of the table's shards in parallel. Shards are copied into a sample program in which we read all the data of the shard and persist it into a csv file with the same name with the shard. After perisence to csv, we can have custom s3 upload logic(not implemented, left to customer now, it can be upload to any provider).

Here is the repo for the solution.

I wonder if we can make that solution into citus. Some important points that can be questioned:

The solution currently works only on nonpartitoned distributed tables.
It blocks the table ddl and and shards' dml by taking locks on coordinator.

gledis69 commented 1 year ago

COPY data_102008 TO '{PATH}/data_102008'; Should just work. A convenient functionality to implement for this would be: COPY data TO '{PATH}/data' WITH (COPY_SHARDS); which would then propagate COPY data_xxxx TO '{PATH}/data_xxxx'; to each worker appropriately.

Optionally, also: COPY data FROM '{PATH}/data' WITH (COPY_SHARDS); where {PATH} can be scanned for data_xxxx files and then propagate COPY data_xxxx-real FROM '{PATH}/data_xxxx'; where xxxx-real is the actual shard id which may or may not be the same as the shard id in the files. This second suggestion though is trickier since it has some implicit constraints that I can think of such as:

assuming integrity of the shard files (us validating the shards would likely defeat the purpose of improved efficiency for migrations)
the dist table needs to have the same amount of shards as shard files in PATH
for v: all values in dist column for file data_xxxx copied into data_xxxx-real shard the hash function used to distribute values to shards should send v to the data_xxxx-real shard.

onderkalaci commented 1 year ago

I'm a bit cautious of directly accessing to the shard via data_102008 or such for reasons:

Shards should mostly be invisible to the user, just an implementation detail of Citus. So, lets not put them to such central part of the api
All locking is done for the distributed tables, not shards. We need to find a good way to handle locks
Now that we have internal/external backend concepts, I also wonder if accessing a shard from an external backend would break anything?

So, if possible COPY dist_table ... WITH (only_shardid:=XXX) type of syntax looks nicer to me

citusdata / citus

Support direct shard COPY #4943