Cannot import large (+200GB) PostgreSQL dump via nodelocal

byakku commented 4 years ago

Describe the problem I'm using cockroachdb v19.2 in Docker, while importing 215GB dump it always exits with

pq: communication error: rpc error: code = Canceled desc = context canceled

https://www.cockroachlabs.com/docs/stable/import.html#known-limitation

Using setting from above does not change anything.

Setup 3x nodes c5.2xlarge (8vCPU, 16GB RAM, 750GB EBS)

To Reproduce

Set up cluster
Copy large database to extern location.
IMPORT PGDUMP 'nodelocal:///big_dump.sql' WITH skip_foreign_keys;
Few hours later
yeet context canceled

Expected behavior I expect Cockroach to import the data successfully.

I did psql dump via:

pg_dump -v --no-owner --no-acl -h rds-example.com -p 5432 -U user --table public.this_huge_table --disable-dollar-quoting > dump.sql

Then made the files be on each machine. I'm using nodelocal while importing, they are on the same disk.

Additional data / screenshots During import I can see numbers of replicas increasing but it drops right after crashing with context error.

Environment:

CockroachDB version: 19.2
Server OS: Ubuntu 20.X
Client app - cockroach sql
2250 IOPS according to AWS

Let me know if there is any secret-feature or debug flag or something, I'll be happy to test that and provide more info if necessary.

blathers-crl[bot] commented 4 years ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

@cockroachdb/bulk-io (found keywords: import)
@solongordon (found keywords: pg_)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

adityamaru commented 4 years ago

Hey @byakku, thanks for reporting this! A context cancellation could be triggered for a number of reasons and so we would need a little more information before we can debug this issue. For starters can you provide us with a debug.zip of the cluster where this import failed?

Steps on how to do this can be found at - https://www.cockroachlabs.com/docs/stable/cockroach-debug-zip.html

byakku commented 4 years ago

debug1.zip

Hello @adityamaru, I've attached the debug log. Keep in mind that I had to "sed-out" some values for obvious reasons, but I probably didn't hurt the logs. :smile:

adityamaru commented 4 years ago

Hi @byakku,

I snooped through the debug.zip and observed a couple of things:

This is the list of the IMPORT jobs with their start, end time. The jobs which failed with context cancelled are the PGDUMP IMPORTs in question.

Multiple PGDUMP imports were attempted between 07/19 to 07/21 on node 1. Unfortunately, node 1 only has logs from 07/20 onwards and so I looked into the IMPORTs attempted after that.

up until 07/21 ~ 13:23 which is after the last PGDUMP import job was started, all the logs in node 1 indicate that the nodes had been unable to establish a stable conncection. I saw multiple logs such as:

grpc: addrConn.createTransport failed to connect to {cockroachdb-prod-node1.example.cloud:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
grpc: addrConn.createTransport failed to connect to {cockroachdb-prod-node2.example.cloud:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...

The node connections seem to have stabilized after 13:23 on 07/21.

As we see in the screenshot of the jobs above, all the CSV IMPORT attempts after the nodes have connected to one another, error out with non-context cancellation errors. My hypothesis is that the issues connecting between the nodes, perhaps that cancelled the context and failed the initial imports.

My recommendation would be to retry the PGDUMP imports on a cluster where all 3 nodes have a stable connection (the latest state of the cluster appears stable).

byakku commented 4 years ago

Hello @adityamaru

Thanks for checking that!

The case is we have "great firewall" to go through, the connectivity may not be stable 24/7 but it usually is, with random ms (~200 - ~300ms). We cannot really do anything about it right now.

CSV imports are working way better, currently we are splitting the huge table into parts and that seems to be working.

Is there a way to make Cockroach do not fail while there are connection issues/increase grpcs timeout?

adityamaru commented 4 years ago

Unfortunately, IMPORT in its current state is not very resilient to node failures. IMPORT runs as a long-running job - checkpointing progress and being able to pick up from where it left off after a node failure is non-trivial. While we do have a lot of this progress tracking logic checked in, there is still some work to be able to not mark a job as "failed" when we see certain kinds of errors (egs: node connection failures). We are continuously improving and making it more resilient!

Hope the CSV imports are working well 🙂

adityamaru commented 4 years ago

Closing this issue, please feel free to comment/reopen if need be.

cockroachdb / cockroach

Cannot import large (+200GB) PostgreSQL dump via nodelocal #51602