Open tbg opened 3 years ago
cc @vrongmeal I think @knz will have ideas here.
The first reaction I have is whether we could start and make-do with only a subset of the timeseries; specifically those used to compute the UI dashboards. Then we could use the existing /ts
endpoint and issue N regular ts queries, one for each ts of interest.
If we agreed to do this, then the default needs not be to disable the fetch; it could be enabled by default.
I don't disagree that we also want a feature to fetch all the timeseries over a configurable period, but it seems to me that we would only want to do that as a next step.
So what are you proposing concretely? Not fetching all timeseries seems ok to me, but that is supported by Dump
as well. Are you suggesting just using ts.Query
(i.e. the endpoint the admin ui uses)? What time range will that fetch? Will we be able to visualize the data? I don't see what the benefit here is over using Dump for which we have the existing (though bare-bones) means of visualization in ./scripts/localmetrics
.
Looking at query, it's fairly similar, so I agree that we could use it instead. In fact, I was probably a little misguided investing as much into Dump
. It looks like at first glance we should be able to implement a fairly ok version of dump using Query
?
yes I was indeed suggesting to use Query.
and specifically, not to scan over all timeseries but only pick a subset manually and hard code it into zip. We can provide more flexibility later.
We now have cockroach debug tsdump
which takes --from
and --to
flags so in principle we can now add this "easily". However, I'm not sure it's fast enough for default use. The problem is that it incurs lots of sequential round trips since it pulls each timeseries by name. I ran an experiment by creating a 12-node geo-distributed cluster. Left that running for an hour or so, then dumped the timeseries through each node in turn. You can see that for some nodes this is fast - they're probably close to the leaseholders for "all ts ranges" (there's probably only one at this point), and for others this is very slow, because they have to do several hundred round-trips to the leaseholder. I don't think the amount of data factors in too heavily here. I could probably pull a day of data without the worst numbers getting much worse, but I have not verified.
Still, this would be a very good addition to debug zip
now that it's so straightforward, even if it's disabled by default. We could then advise TSEs to use a one-week period for everything that is not geo-distributed (assuming this is tolerably fast). I believe this has the potential to turbocharge certain classes of investigations, since the visualization side is now fully working and also reasonably ergonomic.
$ for i in $(seq 1 12); do time ./cockroach debug tsdump --insecure --host $(roachprod ip tobias-ui:$i --external) --format=raw > tswan.gob; echo "^-- ${i}"; done
real 0m4,020s
user 0m1,026s
sys 0m0,264s
^-- 1
real 0m4,439s
user 0m1,033s
sys 0m0,257s
^-- 2
real 0m3,927s
user 0m1,050s
sys 0m0,239s
^-- 3
real 0m3,504s
user 0m1,053s
sys 0m0,262s
^-- 4
real 2m35,235s
user 0m1,292s
sys 0m0,343s
^-- 5
real 2m33,590s
user 0m1,258s
sys 0m0,388s
^-- 6
real 2m35,870s
user 0m1,281s
sys 0m0,419s
^-- 7
real 2m34,698s
user 0m1,338s
sys 0m0,462s
^-- 8
real 3m24,830s
user 0m1,438s
sys 0m0,493s
^-- 9
real 3m21,216s
user 0m1,575s
sys 0m0,531s
^-- 10
real 3m18,436s
user 0m1,601s
sys 0m0,537s
^-- 11
real 3m21,519s
user 0m1,659s
sys 0m0,466s
^-- 12
Is your feature request related to a problem? Please describe. As outlined in https://github.com/cockroachdb/cockroach/issues/46103, we have no ability to explore the time series stored in a cluster unless it is a CC deployment or we're on a screenshare with the customer. Often, screenshots of metrics pages are requested and attached to tickets. This method of debugging is slow and incurs work on both sides.
Describe the solution you'd like
debug zip
should contain a timeseries dump covering the incident and also a good amount buffer around the incident (to be able to compare "normal" vs "exceptional").Describe alternatives you've considered
Additional context @vrongmeal recently landed support for specifying the time interval for ts dumps: https://github.com/cockroachdb/cockroach/pull/57481
Downloading all timeseries may be slow, especially in a geo-distributed cluster. As a quick win, @vrongmeal is looking into adding timeseries download as an opt-in feature to
debug zip
.Straw man:
--timeseries-from=X
and--timeseries-to=Y
whereX
andY
can either be absolute UTC references or Go durations (for example--timeseries-from=-24h
,--timeseries-to=-1h
). Both default to-0h
, meaning no timeseries are downloaded.The timeseries dump implementation is not optimal. To make it optimal, it will need to optimize around memory accounting and also avoid DistSender lack of concurrency (which is a result of MaxSpanRequestKeys).
A good entry point for understanding the limitations of the current Dump implementation is
https://github.com/cockroachdb/cockroach/blob/eced6fa0660cdce024203f0eb00da6e90e9180e7/pkg/ts/server.go#L325-L344
At one glance, you can see that we're iterating over ~400 individual series. Each iteration then paginates in chunks of 1000. This has the potential to be painfully slow in general.
Jira issue: CRDB-6285