cli: download timeseries in `debug zip`

tbg commented 3 years ago

Is your feature request related to a problem? Please describe. As outlined in https://github.com/cockroachdb/cockroach/issues/46103, we have no ability to explore the time series stored in a cluster unless it is a CC deployment or we're on a screenshare with the customer. Often, screenshots of metrics pages are requested and attached to tickets. This method of debugging is slow and incurs work on both sides.

Describe the solution you'd like debug zip should contain a timeseries dump covering the incident and also a good amount buffer around the incident (to be able to compare "normal" vs "exceptional").

Describe alternatives you've considered

Additional context @vrongmeal recently landed support for specifying the time interval for ts dumps: https://github.com/cockroachdb/cockroach/pull/57481

Downloading all timeseries may be slow, especially in a geo-distributed cluster. As a quick win, @vrongmeal is looking into adding timeseries download as an opt-in feature to debug zip.

Straw man:

--timeseries-from=X and --timeseries-to=Y where X and Y can either be absolute UTC references or Go durations (for example --timeseries-from=-24h, --timeseries-to=-1h). Both default to -0h, meaning no timeseries are downloaded.

The timeseries dump implementation is not optimal. To make it optimal, it will need to optimize around memory accounting and also avoid DistSender lack of concurrency (which is a result of MaxSpanRequestKeys).

A good entry point for understanding the limitations of the current Dump implementation is

https://github.com/cockroachdb/cockroach/blob/eced6fa0660cdce024203f0eb00da6e90e9180e7/pkg/ts/server.go#L325-L344

At one glance, you can see that we're iterating over ~400 individual series. Each iteration then paginates in chunks of 1000. This has the potential to be painfully slow in general.

Jira issue: CRDB-6285

tbg commented 3 years ago

cc @vrongmeal I think @knz will have ideas here.

knz commented 3 years ago

The first reaction I have is whether we could start and make-do with only a subset of the timeseries; specifically those used to compute the UI dashboards. Then we could use the existing /ts endpoint and issue N regular ts queries, one for each ts of interest.

If we agreed to do this, then the default needs not be to disable the fetch; it could be enabled by default.

I don't disagree that we also want a feature to fetch all the timeseries over a configurable period, but it seems to me that we would only want to do that as a next step.

tbg commented 3 years ago

So what are you proposing concretely? Not fetching all timeseries seems ok to me, but that is supported by Dump as well. Are you suggesting just using ts.Query (i.e. the endpoint the admin ui uses)? What time range will that fetch? Will we be able to visualize the data? I don't see what the benefit here is over using Dump for which we have the existing (though bare-bones) means of visualization in ./scripts/localmetrics.

tbg commented 3 years ago

Looking at query, it's fairly similar, so I agree that we could use it instead. In fact, I was probably a little misguided investing as much into Dump. It looks like at first glance we should be able to implement a fairly ok version of dump using Query?

knz commented 3 years ago

yes I was indeed suggesting to use Query.

and specifically, not to scan over all timeseries but only pick a subset manually and hard code it into zip. We can provide more flexibility later.

tbg commented 3 years ago

We now have cockroach debug tsdump which takes --from and --to flags so in principle we can now add this "easily". However, I'm not sure it's fast enough for default use. The problem is that it incurs lots of sequential round trips since it pulls each timeseries by name. I ran an experiment by creating a 12-node geo-distributed cluster. Left that running for an hour or so, then dumped the timeseries through each node in turn. You can see that for some nodes this is fast - they're probably close to the leaseholders for "all ts ranges" (there's probably only one at this point), and for others this is very slow, because they have to do several hundred round-trips to the leaseholder. I don't think the amount of data factors in too heavily here. I could probably pull a day of data without the worst numbers getting much worse, but I have not verified.

Still, this would be a very good addition to debug zip now that it's so straightforward, even if it's disabled by default. We could then advise TSEs to use a one-week period for everything that is not geo-distributed (assuming this is tolerably fast). I believe this has the potential to turbocharge certain classes of investigations, since the visualization side is now fully working and also reasonably ergonomic.

$ for i in $(seq 1 12); do time ./cockroach debug tsdump --insecure --host $(roachprod ip tobias-ui:$i --external) --format=raw > tswan.gob; echo "^-- ${i}"; done

real    0m4,020s
user    0m1,026s
sys 0m0,264s
^-- 1

real    0m4,439s
user    0m1,033s
sys 0m0,257s
^-- 2

real    0m3,927s
user    0m1,050s
sys 0m0,239s
^-- 3

real    0m3,504s
user    0m1,053s
sys 0m0,262s
^-- 4

real    2m35,235s
user    0m1,292s
sys 0m0,343s
^-- 5

real    2m33,590s
user    0m1,258s
sys 0m0,388s
^-- 6

real    2m35,870s
user    0m1,281s
sys 0m0,419s
^-- 7

real    2m34,698s
user    0m1,338s
sys 0m0,462s
^-- 8

real    3m24,830s
user    0m1,438s
sys 0m0,493s
^-- 9

real    3m21,216s
user    0m1,575s
sys 0m0,531s
^-- 10

real    3m18,436s
user    0m1,601s
sys 0m0,537s
^-- 11

real    3m21,519s
user    0m1,659s
sys 0m0,466s
^-- 12

cockroachdb / cockroach

cli: download timeseries in `debug zip` #60611