citusdata / pg_shard

ATTENTION: pg_shard is superseded by Citus, its more powerful replacement
https://github.com/citusdata/citus
GNU Lesser General Public License v3.0
1.06k stars 63 forks source link

Add COPY utility to pg_shard #156

Closed knizhnik closed 8 years ago

knizhnik commented 8 years ago

I fixed problems and implement suggestions mentioned in your comments and answered rest of them. So if there is no reply to the comment, then I have fixed it.

marcocitus commented 8 years ago

Thanks for the improvements. It's pretty impressive what you've managed to put together in a few days. I especially like the pluggable transaction manager API. This will also make it much easier to also plug in DTM at some point.

You can use uncrustify to enforce formatting rules. You can install it from http://uncrustify.sourceforge.net/ uncrustify -c .uncrustify.cfg --no-backup src/copy.c

I'll answer all the open items tomorrow. In general, we would like to see some subproblems moved into separate functions to avoid deep nesting and make it easier to understand the code. We also have a 90 column limit. It would be even better if you can stick to the 80 character limit used by PostgreSQL without wrapping a lot of lines.

marcocitus commented 8 years ago

I've responded to the comments in PR #155 . Will be out for a few days. Happy new year!

marcocitus commented 8 years ago

Thanks! I'm running some tests and we're doing an internal demo today

A few open issues.

Regression tests fail on travis (see bottom of PR page) because of different port numbers.

Pressing Ctrl+C while running:

COPY customer_reviews FROM '/home/ec2-user/customer_reviews_1998.csv' WITH CSV;

causes a segmentation fault

Pressing Ctrl+C while running:

\COPY customer_reviews FROM 'customer_reviews_1998.csv' WITH CSV;

doesn't roll back the transaction (should check postgres' QueryCancelPending)

The number of rows is missing from the output when copying to a distributed table:

postgres# \COPY customer_reviews_local FROM 'customer_reviews_1998.csv' WITH CSV;
COPY 589859
postgres# \COPY customer_reviews FROM 'customer_reviews_1998.csv' WITH CSV;
COPY
marcocitus commented 8 years ago

When I try to use the 2PC transaction manager, I get a lot the following result:

postgres=# SET pg_shard.copy_transaction_manager TO '2pc';
SET
postgres=# \COPY customer_reviews FROM 'customer_reviews_1998.csv' WITH CSV;
WARNING:  Bad result from ip-10-186-176-41.ec2.internal:5432
DETAIL:  Remote message: prepared transaction with identifier "pgshard_28544_256" does not exist
WARNING:  Failed to commit prepared transaction for shard 10127
....

It appears to be using the same transaction id for all transactions when committing.

marcocitus commented 8 years ago

In PostgreSQL's COPY, we get to see the line number when there is a parsing error.

postgres=# \COPY customer_reviews FROM 'invalid-customer_reviews_1998.csv' WITH CSV;
ERROR:  missing data for column "review_rating"
postgres=# \COPY customer_reviews_local FROM 'invalid-customer_reviews_1998.csv' WITH CSV;
ERROR:  missing data for column "review_rating"
CONTEXT:  COPY customer_reviews_local, line 589860: "ATVPDKIKX0DER,"

Could we get the same in distributed COPY?

marcocitus commented 8 years ago

I noticed that a regular user can perform COPY from/to a file or program. In PostgreSQL's copy.c there are some check to ensure only superusers can perform these. Could you add those? Ideally, we would check permissions as well.

marcocitus commented 8 years ago

I now get a segmentation fault if I cancel twice:

postgres=# \COPY customer_reviews FROM 'customer_reviews_1998.csv' WITH CSV;
^CCancel request sent
ERROR:  COPY was canceled
Time: 5721.356 ms
postgres=# \COPY customer_reviews FROM 'customer_reviews_1998.csv' WITH CSV;
^CCancel request sent
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
Time: 671.860 ms
knizhnik commented 8 years ago

In PostgreSQL's COPY, we get to see the line number when there is a parsing error.

  1. I think that will be necessary to transfer message detail. Right now ReportRemoteError is not extracting PG_DIAG_MESSAGE_DETAIL and use primary message as message detail: errdetail("Remote message: %s", remoteMessage) Do you want me to change it?
knizhnik commented 8 years ago

Another issue is that reported line will be valid only for data redirected to this shard. There is not mapping with lines in the original file and hard to reconstruct this mapping. So may be there is no much sense in trying to copy detailed error message from replica?

marcocitus commented 8 years ago
  1. I think that will be necessary to transfer message detail.

The worker node doesn't know what the original line number in the file was. I think it would be more useful to count the line number on the master and show it in a message in the PG_CATCH block.

knizhnik commented 8 years ago

The worker node doesn't know what the original line number in the file was. I think it would be more useful to count the line number on the master and show it in a message in the PG_CATCH block.

Please find that error happens on replica, not at mast and due to asynchronous nature of PQputCopyData/PQputCopyEnd we do not know when precisely it happen

knizhnik commented 8 years ago

Can not reproduce the crash: postgres=# copy company from '/tmp/company.csv' WITH CSV; ^CCancel request sent ERROR: COPY was canceled postgres=# copy company from '/tmp/company.csv' WITH CSV; ^CCancel request sent ERROR: COPY was canceled postgres=# copy company from '/tmp/company.csv' WITH CSV; COPY 1000000 postgres=# ^C

knizhnik commented 8 years ago

postgres=# \copy company from '/tmp/company.csv' WITH CSV; ^CERROR: COPY from stdin failed: canceled by user postgres=# \copy company from '/tmp/company.csv' WITH CSV; ^CCancel request sent ERROR: COPY was canceled postgres=# \copy company from '/tmp/company.csv' WITH CSV; ^CCancel request sent ERROR: COPY was canceled postgres=# \copy company from '/tmp/company.csv' WITH CSV; COPY 1000000

marcocitus commented 8 years ago

Please find that error happens on replica, not at mast and due to asynchronous nature of PQputCopyData/PQputCopyEnd we do not know when precisely it happen

For parsing errors, doesn't NextCopyFrom error out?

knizhnik commented 8 years ago

I noticed that a regular user can perform COPY from/to a file or program. In PostgreSQL's copy.c there are some check to ensure only superusers can perform these. Could you add those? Ideally, we would check permissions as well.

Permissions will be in any case checked at replicas. Why do we need to do it at master?

marcocitus commented 8 years ago

Permissions will be in any case checked at replicas. Why do we need to do it at master?

When you do a COPY FROM 'file' that file is read on the master, and if you do COPY FROM PROGRAM, that program is executed on the master. Only superuser should be allowed to do this. The workers just get a COPY FROM STDIN, which doesn't require superuser.

marcocitus commented 8 years ago

Looks pretty good now. We're happy about the performance and behaviour.

I am seeing some issues when something fails during the COPY. If I do killall postgres on a worker node while a COPY is going on, then the master gets a segmentation fault.

I'm also seeing segmentation faults when trying to ingest many files in parallel (exhausting max_connections on the workers). I was running:

split -nl/64 customer_reviews_1998.csv chunks/
find chunks -type f | time xargs -t -n1 -P64 sh -c "psql -c \"\\COPY customer_reviews FROM '\$0' WITH CSV\""
knizhnik commented 8 years ago

Did you use the latest version? Yesterday I have fixed one reason of SIGSEGV. I can not reproduce the error: knizhnik@knizhnik:~/dtm-data/customer_reviews$ find chunks -type f | time xargs -t -n1 -P64 sh -c "psql postgres -c \"\COPY customer_reviews FROM '\$0' WITH CSV\"" sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ba sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ad sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/az sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/an sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/cj sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bu sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/cb sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ax sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bf sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bp sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bx sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ak sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/af sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/cd sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ao sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/be sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ay sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bl sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/cc sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bo sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/cg sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/am sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/av sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bi sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ab sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ca sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bg sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/aj sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bk sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bc sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ag sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ah sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/as sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bj sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bb sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/cl sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bv sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bz sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/aa sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bw sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/cf sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/al sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/au sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bh sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bm sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/aw sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/aq sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/br sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bs sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ck sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ae sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ar sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ci sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bd sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/at sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bq sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ch sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bt sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/by sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/bn sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ap sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ac sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ai sh -c psql postgres -c "\COPY customer_reviews FROM '$0' WITH CSV" chunks/ce COPY 18286 COPY 18267 COPY 18316 COPY 18271 COPY 18306 COPY 18212 COPY 18290 COPY 18356 COPY 18323 COPY 18434 COPY 18367 COPY 18302 COPY 18410 COPY 18293 COPY 18145 COPY 18252 COPY 18245 COPY 18202 COPY 18341 COPY 18339 COPY 18379 COPY 18207 COPY 18399 COPY 18408 COPY 18387 COPY 18265 COPY 18347 COPY 18360 COPY 18421 COPY 18341 COPY 18272 COPY 18246 COPY 18415 COPY 18228 COPY 18488 COPY 18370 COPY 18369 COPY 18382 COPY 18377 COPY 18322 COPY 18394 COPY 18353 COPY 18262 COPY 18348 COPY 18245 COPY 18247 COPY 18400 COPY 18194 COPY 18349 COPY 18332 COPY 18378 COPY 18397 COPY 18225 COPY 18362 COPY 18229 COPY 18441 COPY 18389 COPY 18360 COPY 18347 COPY 18327 COPY 18168 COPY 18295 COPY 18245 COPY 18418 0.29user 0.24system 0:04.31elapsed 12%CPU (0avgtext+0avgdata 3644maxresident)k 0inputs+0outputs (0major+17163minor)pagefaults 0swaps

marcocitus commented 8 years ago

You're right, that fixes it. Great!