Closed byakku closed 4 years ago
Hello, I am Blathers. I am here to help you get the issue triaged.
Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.
I have CC'd a few people who may be able to assist you:
If we have not gotten back to your issue within a few business days, you can try the following:
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.
Hey @byakku, thanks for reporting this!
A context cancellation could be triggered for a number of reasons and so we would need a little more information before we can debug this issue. For starters can you provide us with a debug.zip
of the cluster where this import failed?
Steps on how to do this can be found at - https://www.cockroachlabs.com/docs/stable/cockroach-debug-zip.html
Hello @adityamaru, I've attached the debug log. Keep in mind that I had to "sed-out" some values for obvious reasons, but I probably didn't hurt the logs. :smile:
Hi @byakku,
I snooped through the debug.zip and observed a couple of things:
This is the list of the IMPORT jobs with their start, end time. The jobs which failed with context cancelled are the PGDUMP IMPORTs in question.
Multiple PGDUMP imports were attempted between 07/19 to 07/21 on node 1. Unfortunately, node 1 only has logs from 07/20 onwards and so I looked into the IMPORTs attempted after that.
up until 07/21 ~ 13:23 which is after the last PGDUMP import job was started, all the logs in node 1 indicate that the nodes had been unable to establish a stable conncection. I saw multiple logs such as:
grpc: addrConn.createTransport failed to connect to {cockroachdb-prod-node1.example.cloud:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
grpc: addrConn.createTransport failed to connect to {cockroachdb-prod-node2.example.cloud:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
The node connections seem to have stabilized after 13:23 on 07/21.
As we see in the screenshot of the jobs above, all the CSV IMPORT attempts after the nodes have connected to one another, error out with non-context cancellation errors. My hypothesis is that the issues connecting between the nodes, perhaps that cancelled the context and failed the initial imports.
My recommendation would be to retry the PGDUMP imports on a cluster where all 3 nodes have a stable connection (the latest state of the cluster appears stable).
Hello @adityamaru
Thanks for checking that!
The case is we have "great firewall" to go through, the connectivity may not be stable 24/7 but it usually is, with random ms (~200 - ~300ms). We cannot really do anything about it right now.
CSV imports are working way better, currently we are splitting the huge table into parts and that seems to be working.
Is there a way to make Cockroach do not fail while there are connection issues/increase grpcs timeout?
Unfortunately, IMPORT in its current state is not very resilient to node failures. IMPORT runs as a long-running job - checkpointing progress and being able to pick up from where it left off after a node failure is non-trivial. While we do have a lot of this progress tracking logic checked in, there is still some work to be able to not mark a job as "failed" when we see certain kinds of errors (egs: node connection failures). We are continuously improving and making it more resilient!
Hope the CSV imports are working well 🙂
Closing this issue, please feel free to comment/reopen if need be.
Describe the problem I'm using cockroachdb v19.2 in Docker, while importing 215GB dump it always exits with
https://www.cockroachlabs.com/docs/stable/import.html#known-limitation
Using setting from above does not change anything.
Setup 3x nodes c5.2xlarge (8vCPU, 16GB RAM, 750GB EBS)
To Reproduce
extern
location.IMPORT PGDUMP 'nodelocal:///big_dump.sql' WITH skip_foreign_keys;
context canceled
Expected behavior I expect Cockroach to import the data successfully.
I did psql dump via:
Then made the files be on each machine. I'm using
nodelocal
while importing, they are on the same disk.Additional data / screenshots During import I can see numbers of replicas increasing but it drops right after crashing with context error.
Environment:
Let me know if there is any secret-feature or debug flag or something, I'll be happy to test that and provide more info if necessary.