Open tbg opened 6 years ago
@RaduBerinde and perhaps @tschottdorf do you mind giving an approximate time to do this? doesn't sound like it would take too long? SELECT .. FROM system.table_statistics makes sense to me.
system.table_statistics
is meant for powering query planning. It is possible that it could contain approximate disk size, but I see some impedance mismatch. For example, I believe that table only contains stats on the primary index right now while for the admin UI I imagine you'd want to include the approximate disk size of indexes too. Also, the refresh rate of system.table_statistics
might be too low for the admin UI. Lastly, table statistics are only going to be populated on demand (via a CREATE STATISTICS
) call for 2.0.
Adding a crdb_internal
virtual table (or a column to an existing table) and powering it via a fan-out TableStats
might be more tenable in the near term.
@petermattis makes sense to me, although my understanding was that we don't do much in terms of documenting crdb_internal. @jseldess correct me if I'm wrong?
I think we are rapidly coming to the point where we have to consider exposing metrics in the admin UI also through either SQL or some other API for users trying to automate things based on metrics, so I'm a bit concerned about choosing one path here that we then have to change when we finally have time to think about how to expose these metrics in a way that is easily consumable by users in the format they want. What are your thoughts on that?
That's was the basic agreement in the past, yes, @dianasaur323. @knz was involved in that decision. Could we add a SQL statement that queries that internal table?
I think we are rapidly coming to the point where we have to consider exposing metrics in the admin UI also through either SQL or some other API for users trying to automate things based on metrics, so I'm a bit concerned about choosing one path here that we then have to change when we finally have time to think about how to expose these metrics in a way that is easily consumable by users in the format they want. What are your thoughts on that?
I'm in agreement that we should be able to access vs internal metrics via SQL. Adding SQL statements for each of these would likely result in an explosion of statements. I believe the hesitance in the past to document crdb_internal
was to give ourselves freedom to make backwards incompatible changes. But we can either revisit that decision or design another set of virtual tables for which we promise to maintain compatibility.
In this case, should we just do the crdb_internal virtual table approach for this specific metric and revisit when more people start asking for programmatic access to these metrics?
Diana
On Thu, Dec 14, 2017 at 3:29 PM, Peter Mattis notifications@github.com wrote:
I think we are rapidly coming to the point where we have to consider exposing metrics in the admin UI also through either SQL or some other API for users trying to automate things based on metrics, so I'm a bit concerned about choosing one path here that we then have to change when we finally have time to think about how to expose these metrics in a way that is easily consumable by users in the format they want. What are your thoughts on that?
I'm in agreement that we should be able to access vs internal metrics via SQL. Adding SQL statements for each of these would likely result in an explosion of statements. I believe the hesitance in the past to document crdb_internal was to give ourselves freedom to make backwards incompatible changes. But we can either revisit that decision or design another set of virtual tables for which we promise to maintain compatibility.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/20712#issuecomment-351826979, or mute the thread https://github.com/notifications/unsubscribe-auth/AFgDWJ98GfSydy0Qx24FFZgZnQQDAwEHks5tAYUtgaJpZM4RCNKm .
-- Diana Hsieh diana@cockroachlabs.com 407-690-9048
I think the virtual table makes sense for now. Querying this programmatically is generally a little dangerous too because the work done in computing these quantities isn't trivial, so we shouldn't exactly advertise it.
But this also constrains us somewhat: if we make this a virtual table with strawman schema (table_id, approx_size_bytes)
, will any query to this table compute the stats for all tables? We should avoid that.
If that restriction makes things difficult, a more straightforward way is to make that a function (so that it can only be invoked for individual tableIDs).
With a little bit of work we can plumb a filter to the virtual table
generator function so it only computes it for one table if there is a
WHERE table_id = x
-Radu
On Thu, Dec 14, 2017 at 5:21 PM, Tobias Schottdorf <notifications@github.com
wrote:
I think the virtual table makes sense for now. Querying this programmatically is generally a little dangerous too because the work done in computing these quantities isn't trivial, so we shouldn't exactly advertise it.
But this also constrains us somewhat: if we make this a virtual table with strawman schema (table_id, approx_size_bytes), will any query to this table compute the stats for all tables? We should avoid that.
If that restriction makes things difficult, a more straightforward way is to make that a function (so that it can only be invoked for individual tableIDs).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/20712#issuecomment-351853939, or mute the thread https://github.com/notifications/unsubscribe-auth/APxxeGuC0HCn_MW7fB4XOZrBGfJNvRgvks5tAZ9kgaJpZM4RCNKm .
@vivekmenezes do you mind seeing if anyone would have time to do this? it would be nice since it is a customer request.
https://www.postgresql.org/docs/current/static/functions-admin.html#FUNCTIONS-ADMIN-DBSIZE
The entire page is a list of the various admin related functions that we could support. I'm in favor of us figuring out what we want to support and building a project around it rather than doing a one off feature request.
@vivekmenezes I think we should leverage the flexibility we have in crdb_internal
to iterate quickly and stabilize things as they've been in for a while (moving them out of crdb_internal
or coming up with some other scheme). It's hard to communicate to a user that we're not going to provide necessary functionality because we want to polish first.
:+1: on publicizing a set of vtables that we're willing to expose in docs and users. I would however use two separate namespaces though -- perhaps crdb_internal
for those we want to expose and crdb_experimental
as a "staging area" for features we're not sure about yet.
I'd suggest migrating most of those we have in crdb_internal
to crdb_experimental
, with the exception of those few the users want and we can document.
Also I think it's OK to have some redundancy between the namespaces.
@knz that sounds like a good approach. cc @jseldess so that he is aware that we may have to account for time to document this
I'd keep crdb_internal
for the experimental/fast-moving stuff and maybe crdb_system
for things that we want users to be able to rely on.
Just want to put a soft -1 on introducing another top-level virtual table namespace until we nail down what's going on with our schemas vs databases.
UI tools show crdb_internal
and system
as top-level databases and schemas that are present in all databases.
Perhaps we could use this moment to discuss the UX around adding a top-level database that contains several schemas for internal purposes. For example, the crdb
database could contain internal
, system
(our current system
db), and some other schema that represents this reliable external interface.
i agree that we should have that larger discussion - the things we expose to users is getting pretty spread out across different interfaces. that being said, i think it sounds like it's time for someone to take ownership of this and come up with a more complete proposal?
Anyone against implementing
pg_table_size(regclass)
for this issue?
+1 to that suggestion - I've seen several PG admin ui tools that use that builtin.
The API call we have right now gives the approximate total replicated size of a table and it needs to fan out across the cluster and also scans from the meta ranges. Hooking this up to pg_table_size
is possibly something we could regret.
I imagine a system tracking table/database statistics and putting it in one place, and pg_table_size() looking up the stats table.
it seems like this project is expanding a bit in scope. How long do you think this would take in terms of implementing, and is it still reasonable to get this done by 2.0?
I doubt anyone's going to get to it anytime soon
Let's do this in 2.1 then. Moving milestone.
https://forum.cockroachlabs.com/t/index-size-on-disk/1519 is somewhat related.
When evaluating CRDB (and comparing it with other databases) as well as getting a handle on primary key choices to best suit ones use cases it's really useful to know how the increase of table rows affects the growth of indexes.
We're having a request on gitter from @sosedoff who is developing pgweb.
I need to a programmatic way to determine disk usage of each database as well. From what I can see this is still not resolved. I was originally looking for TABLESPACE
, but with the distributed nature I see why that would be difficult to manage.
Any ETA on the DB size issue?
is there an alternate way to estimate size from SQL?
^ @awoods187
Zendesk ticket #5125 has been linked to this issue.
@tbg isn't this possible (but slow) with something like:
select sum(range_size) from crdb_internal.ranges where database_name = 'foo' and table_name='bar'
?
@jordanlewis the query above does not provide any output on my v19.2.5 clusters.
What version were you testing this with?
on 19.2.5, the defaultdb
is reported as 321.0 KiB by Admin UI. Below returns 731KiB. range_size
is (crdb_internal.range_stats(start_key)->>'key_bytes')::INT8 + (crdb_internal.range_stats(start_key)->>'val_bytes')::INT8 AS range_size
. what is the difference in data captured between Admin UI and the range_size
?
> select sum(range_size)/1000 from crdb_internal.ranges where database_name = 'defaultdb';
?column?
+----------+
731.563
From one of my production clusters:
user@host:26257/defaultdb> select sum(range_size)/1000 from crdb_internal.ranges where database_name = 'defaultdb';
?column?
+----------+
NULL
(1 row)
Time: 31.687465ms
Version:
# cockroach version
Build Tag: v19.2.5
Build Time: 2020/03/16 18:27:12
Distribution: CCL
Platform: linux amd64 (x86_64-unknown-linux-gnu)
Go Version: go1.12.12
C Compiler: gcc 6.3.0
Build SHA-1: 4f36d0c62435596ca103454e113ebe8e55f005de
Build Type: release
@charl could show NULL
if there are no tables in defaultdb
show tables from defaultdb; table_name +------------+ (0 rows)
@robert-s-lee , gotcha.
user@host:26257/oneconfig> select sum(range_size)/1000 from crdb_internal.ranges where database_name = 'oneconfig';
?column?
+-------------+
3697358.846
(1 row)
Time: 439.516065ms
Note that anything based off crdb_internal.ranges
is performing a cluster-wide query and is thus relatively expensive to compute, both in time and in network resources.
Also anything based off crdb_internal
accumulates all the rows in RAM before the client does anything, so in this case if you have many ranges you can run out of RAM.
So all in all, the solutions discussed so far are band-aids and no substitute for a properly designed feature able to report table sizes without such a large expense in resources.
Hi there, any chance the PR will be merge?
As of https://github.com/cockroachdb/cockroach/pull/20627, the admin UI computes the approximate on-disk size of tables. However, it makes sense to expose this information programmatically via SQL. All that's needed for this is to find a place to expose it, and populate it with from
(*adminServer).TableStats
. This call isn't extremely cheap (has fan-out and needs to scan a subset of the meta entries), but caching could be added as necessary.Perhaps there is a connection to be made with #20323. It seems reasonable that approximate on-disk size for a table would be exposed in
SELECT .. FROM system.table_statistics
. (cc @RaduBerinde).@dianasaur323 for triage.
Jira issue: CRDB-5915
Epic: CRDB-24527