kvserver: timeout during export request on 16gb range

tbg commented 1 year ago

Describe the problem

In https://github.com/cockroachdb/cockroach/issues/104588 we're seeing a backup fail to back up a 16GiB range. I've learned that ExportRequest reads 16mb worth of values; in this case it was possibly an incremental that didn't find anything new and so had to scan the entire 16GB, which is no bueno - very expensive.

While we don't endorse let alone support 16GiB ranges, it stands to reason that backup should be able to back up ranges of any size, as sometimes ranges may grow to that size without the operator being at fault.

Also, we are entertaining the idea of increasing the default range sizes significantly, which will likely put this issue on the menu at least in some deployments.

So we should find a way to paginate on the "bytes processed" and not "bytes returned".

To Reproduce

Presumably doing what the linked roachtest does to get the large range and then trying to back up the table will reproduce it.

Related

https://github.com/cockroachdb/cockroach/issues/103879 is about a similar issue when sending snapshots.

Jira issue: CRDB-30090

blathers-crl[bot] commented 1 year ago

cc @cockroachdb/disaster-recovery

dt commented 1 year ago

@tbg The 5min timeout is per request, not per range, and regardless of how large the range is, any given request is sent with a 16MB pagination size limit. Why does a 16gb range take longer than a 512mib range to read the same 16mb?

tbg commented 1 year ago

I updated the issue to say that the pagination should be based on bytes processed, not bytes returned - you are probably right that it's an incremental that doesn't return ~anything and so has to read 16gb. Feel free to retitle, adjust, etc!

cockroachdb / cockroach

kvserver: timeout during export request on 16gb range #107519