cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.05k stars 3.8k forks source link

Cannot import large (+200GB) PostgreSQL dump via nodelocal #51602

Closed byakku closed 4 years ago

byakku commented 4 years ago

Describe the problem I'm using cockroachdb v19.2 in Docker, while importing 215GB dump it always exits with

pq: communication error: rpc error: code = Canceled desc = context canceled

https://www.cockroachlabs.com/docs/stable/import.html#known-limitation

Using setting from above does not change anything.

Setup 3x nodes c5.2xlarge (8vCPU, 16GB RAM, 750GB EBS)

To Reproduce

  1. Set up cluster
  2. Copy large database to extern location.
  3. IMPORT PGDUMP 'nodelocal:///big_dump.sql' WITH skip_foreign_keys;
  4. Few hours later
  5. yeet context canceled

Expected behavior I expect Cockroach to import the data successfully.

I did psql dump via:

pg_dump -v --no-owner --no-acl -h rds-example.com -p 5432 -U user --table public.this_huge_table --disable-dollar-quoting > dump.sql

Then made the files be on each machine. I'm using nodelocal while importing, they are on the same disk.

Additional data / screenshots During import I can see numbers of replicas increasing but it drops right after crashing with context error.

Environment:

Let me know if there is any secret-feature or debug flag or something, I'll be happy to test that and provide more info if necessary.

blathers-crl[bot] commented 4 years ago

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

adityamaru commented 4 years ago

Hey @byakku, thanks for reporting this! A context cancellation could be triggered for a number of reasons and so we would need a little more information before we can debug this issue. For starters can you provide us with a debug.zip of the cluster where this import failed?

Steps on how to do this can be found at - https://www.cockroachlabs.com/docs/stable/cockroach-debug-zip.html

byakku commented 4 years ago

debug1.zip

Hello @adityamaru, I've attached the debug log. Keep in mind that I had to "sed-out" some values for obvious reasons, but I probably didn't hurt the logs. :smile:

adityamaru commented 4 years ago

Hi @byakku,

I snooped through the debug.zip and observed a couple of things:

Screen Shot 2020-07-22 at 10 15 55 AM

This is the list of the IMPORT jobs with their start, end time. The jobs which failed with context cancelled are the PGDUMP IMPORTs in question.

As we see in the screenshot of the jobs above, all the CSV IMPORT attempts after the nodes have connected to one another, error out with non-context cancellation errors. My hypothesis is that the issues connecting between the nodes, perhaps that cancelled the context and failed the initial imports.

My recommendation would be to retry the PGDUMP imports on a cluster where all 3 nodes have a stable connection (the latest state of the cluster appears stable).

byakku commented 4 years ago

Hello @adityamaru

Thanks for checking that!

The case is we have "great firewall" to go through, the connectivity may not be stable 24/7 but it usually is, with random ms (~200 - ~300ms). We cannot really do anything about it right now.

CSV imports are working way better, currently we are splitting the huge table into parts and that seems to be working.

Is there a way to make Cockroach do not fail while there are connection issues/increase grpcs timeout?

adityamaru commented 4 years ago

Unfortunately, IMPORT in its current state is not very resilient to node failures. IMPORT runs as a long-running job - checkpointing progress and being able to pick up from where it left off after a node failure is non-trivial. While we do have a lot of this progress tracking logic checked in, there is still some work to be able to not mark a job as "failed" when we see certain kinds of errors (egs: node connection failures). We are continuously improving and making it more resilient!

Hope the CSV imports are working well 🙂

adityamaru commented 4 years ago

Closing this issue, please feel free to comment/reopen if need be.