dimitri / pgcopydb

Copy a Postgres database to a target Postgres server (pg_dump | pg_restore on steroids)
Other
1.18k stars 78 forks source link

Add sizes to Totals output table #258

Open aborschev opened 1 year ago

aborschev commented 1 year ago

It would be nice to add to "Finals" table below the information about:

This is low-priority wish, just for convenience.

                                          Step   Connection    Duration   Concurrency
 ---------------------------------------------   ----------  ----------  ------------
                                   Dump Schema       source       6s045             1
                                Prepare Schema       target       2s292             1
 COPY, INDEX, CONSTRAINTS, VACUUM (wall clock)         both       3h28m         4 + 4
                             COPY (cumulative)         both      12h01m             4
                    Large Objects (cumulative)         both        24ms             1
        CREATE INDEX, CONSTRAINTS (cumulative)       target      21h36m             4
                               Finalize Schema       target       1h51m             1
 ---------------------------------------------   ----------  ----------  ------------
                     Total Wall Clock Duration         both       5h19m         4 + 4
 ---------------------------------------------   ----------  ----------  ------------
dimitri commented 1 year ago

That's a good comment, thanks. Most of the data mentioned is not available yet in pgcopydb though, and require extra setup and computations to fetch:

It's not trivial to guess the exact impact of computing all those operations. As timings are already captured, I could also add a command to get a summary of the average times to build indexes (and rates), and I suspect adding the buffer size capture and computation to our COPY code would not slow it down too much.

Contributions welcome, with an eye to do the easy bits (in terms of perf impact) first!

aborschev commented 1 year ago

Hmm, Your proposal with counting bytes during copy-in-copy-out is precise, but looks too complicated. I meant much simpler way: like accumulate pg_relation_size() for each copied table, divide by time (COPY cumulative metric) and get average data transfer rate. One figure for the whole copying session would be enough here. Same logic - with indexes.

P.S. The practical question behind this: If we've got for example 1 Gbit network, which % of bandwidth we've utilized?
Do we need to increase\decrease number of copy jobs to utilize bandwidth, but avoid contention?

dimitri commented 1 year ago

But it does not work that way. The amount of bytes sent through the network is not the same as the pg_relation_size() on the source, because of bloat (tuples on-disk that are not visible to our transaction). And the size sent over the network also includes network protocol metadata, whereas the pg_relation_size() on the target includes storage metadata of the target Postgres version. So transfer rate can NOT be computed from the on-disk size of the relation, which is going to be different on source and target even with the same exact data set.

Then on-top of that, because of the concurrency aspects of pgcopydb, if your question is one of bandwidth capacity, it becomes very hard to answer that with numbers computed by individual processes. You'd be better off with OS-level tooling here, such as dstat for instance.

hanefi commented 11 months ago

503 contains a small portion of proposed changes here.