cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.89k stars 3.77k forks source link

server: Problem Ranges report timing out on user cluster #34311

Closed andreimatei closed 5 years ago

andreimatei commented 5 years ago

@matbhuvi's cluster has trouble generating that report. We seem to have a 3s timeout for generating that report, which is silly. I'm gonna increase it. Regardless, it's unclear why it times out. I've tried on a cluster we have with some reasonable amount of data, and the report was instantaneous. This is hard to debug... I was hoping I could trace the generation of the report, but I'm having lots of trouble. Opened #34310.

The only way to debug that I can think of is to try to get some goroutine stack dumps at the moment the report generation is running. @matbhuvi, would you mind trying to generate that report a few times, and immediately after you refresh the report page, switch to the debug page of a couple of other nodes and click on the All Goroutines link. If the results contain anything from status.go, they might be useful for us. If not, I'm not sure how to debug further, although I'll work on generally improving that code and its debuggability for the future. Thanks!

matbhuvi commented 5 years ago

Able to get status.go once after multiple retries. I have shared the log goroutine.log

andreimatei commented 5 years ago

Thanks, will look. I wanted to ask - does it still timeout after restarting servers?

matbhuvi commented 5 years ago

I wanted to ask - does it still timeout after restarting servers?

I just tried that out. I saw it working once in 5 - 6 attempts. When it worked, i noticed Connections (via Node 1) was always the case. Not sure that is relevant for the problem here.

Problem Ranges _ Jan31.pdf

andreimatei commented 5 years ago

Sorry for letting this sit; I've been gone skiing for a few days.

The goroutine dump you've sent doesn't tell us anything beyond the fact that the code trying to generate the report is busy iterating through the ranges (we need to iterate through all the ranges).

Can you please clarify something about the screenshot you've sent: if that's what it looks like when you consider it to be "working", what does it look like when it's not working? Because that screenshot looks like the effects of the timeout to me.