Add instrumentation to borg create?

textshell commented 7 years ago

Just a thought from an irc discussion about finding backup performance problems. If we had some option that outputs more detailed statistics about potentially performance relevant details it might be a lot easier to diagnose performance problems.

Ideas:

number of files cache hits
number of files cache indicated changed files
number of files not in files cache at all
time spent in chunking
time spent in id hashing
time spent in crypto (encryption + integrity hash)
bytes sent to remote
bytes read from remote
can we do an reasonable approximation of how long borg was blocked by remote communications?

:moneybag: there is a bounty for this

enkore commented 7 years ago

bytes send to remote bytes read from remote

I actually wrote that a few days ago. #2210

=> Separate borg.output. logger + cmdline option for these?

textshell commented 7 years ago

Magic ;)

I was thinking like this would be an new option similar to --summary but with performance focused information. Still more like a report than just some logging lines. Just using a logger might be easier though. I expect some of the data collection to potentially have a performance impact though. (or maybe not? e.g. querying the current time on linux is just a memory read)

enkore commented 7 years ago

Direct tracing (via sys.setprofile) is really quite expensive, adding ~two extra function calls to encrypt+decrypt+id_hash has maybe not such a big impact, but perhaps it would be better to implement it directly (not via @decorator magic or extra methods).

textshell commented 7 years ago

I didn‘t consider sys.setprofile a valid path for lightweight stats. I agree that such performance code should be kept lean and not too magic. But using an extra method would ease implementation: Something like this would be fairly maintainable and easy to check for correct usage.

    performance_stats.set_section("chunking")
    ...
    performance_stats.set_section("id_hash")

textshell commented 7 years ago

More synthetic tests are covered by #1788 i think.

enkore commented 7 years ago

RemoteRepository: 139 bytes sent, 46427 bytes received, 5 messages sent
Resource Usage (relative): 5% CPU, 0% input, 94% network
Resource Usage (absolute): 0.18 CPU, 0.0 input, 3.2 network, 3.5 wall clock

Inspired by a screenshot I saw of Veeam that tells you similar numbers and even has a "Bottleneck" item in its summary.

(However, the "network" part includes the startup time of the remote server -- so me entering a SSH password as well --, but that's easy to remove)

The "input" isn't populated yet. For the remote code it's very simple and just a few spots, but "input" is something done at very many places, so I suspect even adding ~two function calls per place might impact performance. Perhaps do it rather coarsely grained, via the existing backup_io blocks.

Resource Usage (relative): 33% CPU, 68% input, 9% repository, 0% cache sync
Resource Usage (absolute): 20.87s CPU, 42.24s input, 5.92s repository, 62.37s wall clock, 0.00s cache sync
Resource Usage: bottleneck: input

(Actual "borg create"... CPU is inclusive, the other's aren't -- it's from getrusage. So cache sync runs -> 99 % cache sync but also 99 % CPU)

Resource Usage (relative): 34% CPU, 53% input, 32% repository, 0% cache sync
Resource Usage (absolute): 6775.28s CPU, 10520.41s input, 6427.79s repository, 19934.58s wall clock, 0.00s cache sync
Resource Usage: bottleneck: input

textshell commented 7 years ago

Rename "Resource Usage" to "Time used in" or something like that? Currently i think this sounds like some kind of magic.

enkore commented 7 years ago

Yeah, it's just the class name for now (diff, messy, so no PR now). Naming is hard :)

fadedreamz commented 3 years ago

@ThomasWaldmann I would like to give it a shot, can you kindly assign it to me?

ThomasWaldmann commented 3 years ago

@fakedreamz you're assigned (you need to "start working on this" on bountysource yourself).

After you had a look at the relevant code, maybe update the ticket with your plans.

fadedreamz commented 3 years ago

@ThomasWaldmann thanks, will do. Is there any other documentation apart from the docs folder to understand the high-level design?

ThomasWaldmann commented 3 years ago

There might be some more insights distributed over the issues here and on the mailing list, but the docs, esp. internals docs are maybe the best read.

If you search for a starting point: they are in src/borg/archiver.py, e.g. do_create implements borg create.

fadedreamz commented 3 years ago

@ThomasWaldmann would like to confirm my findings with you.

1 The process_file function has all the status for a file that is being processed.

number of files cache hits Can be determined by the number of files with status = U (unchanged)
number of files cache indicated changed files Can be determined by the number of files with status = M (modified), not sure if we need to include also the C
number of files not in files cache at all Can be determined by the number of files with status = 'A' (added)

2 The chunk_processor function can be instrumented for the following -

time spent chunking The chunk_iter is the iterator that reads the file and thus we can add instrumentation here to gather statistics.
time spent in id hashing The id_hash function calculates the hash of a chunk, so we can add instrumentation before and after to gather statistics.

3 For the following, I still haven't explored them (but maybe they are in RemoteRepository)

time spent in crypto (encryption + integrity hash)
bytes sent to remote
bytes read from remote

4 No comment yet on

can we do a reasonable approximation of how long borg was blocked by remote communications?

5 I haven't looked into process_dir/pipe/dev/dir_with_fd/fifo/symlink in FilesystemObjectProcessors. Should they be also considered for statistics?

Let me know your thoughts/comments on the findings. TIA

ThomasWaldmann commented 3 years ago

1 yes, but maybe increase the respective counter at the same place where these status chars are set rather than deriving that from the status char.

2 sounds right

3/4 remote stuff is in RemoteRepository, yes. crypto and compression is in one nested call, iirc (and not in remote.py).

5 maybe stats about file types are not too interesting. usually most stuff will be regular files and dirs. and if people are backing up block devices, they know it already.

In general it would be good to add as little code as possible for this (some code parts are already rather complex and we should try to make readability not even worse). We already have stats - can we add some attributes in there?

Please also check the comments in this ticket, some work has already been done / some ideas were evaluated.

ThomasWaldmann commented 3 years ago

For a start, maybe just do the stuff that uses simple counters.

francoo98 commented 2 years ago

Hello. I've been thinking about how to implement this so I forked the repository and pushed three short commits.

In the first one, I added a dictionary to fso.stats and added a line to the FileSystemObjectProcessor.process_file to update the dictionary once a file has been processed. This dictionary has keys corresponding to the files status A, U and M; and it's values are counters.

For the chunking, I added a variable to FileSystemObjectProcessor.stats called chunking_time to accumulate the time spent in chunking.

Finally, I measure the time spent in hashing calling the time.time method before and after calling id_hash method.

At this moment, the printing of these new variables is done at the end of the Archive.create_inner method, but I'm thinking about programming the Statistics.__add__ to make it append these new variables too.

Now, I am working on calculating the bytes sent and read from remote. For this, I think it is enough to use the attributes tx_bytes and rx_bytes in RemoteRepository