borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.22k stars 743 forks source link

Add instrumentation to borg create? #2209

Closed textshell closed 2 years ago

textshell commented 7 years ago

Just a thought from an irc discussion about finding backup performance problems. If we had some option that outputs more detailed statistics about potentially performance relevant details it might be a lot easier to diagnose performance problems.

Ideas:


:moneybag: there is a bounty for this

enkore commented 7 years ago

bytes send to remote bytes read from remote

I actually wrote that a few days ago. #2210

=> Separate borg.output. logger + cmdline option for these?

textshell commented 7 years ago

Magic ;)

I was thinking like this would be an new option similar to --summary but with performance focused information. Still more like a report than just some logging lines. Just using a logger might be easier though. I expect some of the data collection to potentially have a performance impact though. (or maybe not? e.g. querying the current time on linux is just a memory read)

enkore commented 7 years ago

Direct tracing (via sys.setprofile) is really quite expensive, adding ~two extra function calls to encrypt+decrypt+id_hash has maybe not such a big impact, but perhaps it would be better to implement it directly (not via @decorator magic or extra methods).

textshell commented 7 years ago

I didn‘t consider sys.setprofile a valid path for lightweight stats. I agree that such performance code should be kept lean and not too magic. But using an extra method would ease implementation: Something like this would be fairly maintainable and easy to check for correct usage.

    performance_stats.set_section("chunking")
    ...
    performance_stats.set_section("id_hash")
textshell commented 7 years ago

More synthetic tests are covered by #1788 i think.

enkore commented 7 years ago
RemoteRepository: 139 bytes sent, 46427 bytes received, 5 messages sent
Resource Usage (relative): 5% CPU, 0% input, 94% network
Resource Usage (absolute): 0.18 CPU, 0.0 input, 3.2 network, 3.5 wall clock

Inspired by a screenshot I saw of Veeam that tells you similar numbers and even has a "Bottleneck" item in its summary.

(However, the "network" part includes the startup time of the remote server -- so me entering a SSH password as well --, but that's easy to remove)

The "input" isn't populated yet. For the remote code it's very simple and just a few spots, but "input" is something done at very many places, so I suspect even adding ~two function calls per place might impact performance. Perhaps do it rather coarsely grained, via the existing backup_io blocks.

Resource Usage (relative): 33% CPU, 68% input, 9% repository, 0% cache sync
Resource Usage (absolute): 20.87s CPU, 42.24s input, 5.92s repository, 62.37s wall clock, 0.00s cache sync
Resource Usage: bottleneck: input

(Actual "borg create"... CPU is inclusive, the other's aren't -- it's from getrusage. So cache sync runs -> 99 % cache sync but also 99 % CPU)

Resource Usage (relative): 34% CPU, 53% input, 32% repository, 0% cache sync
Resource Usage (absolute): 6775.28s CPU, 10520.41s input, 6427.79s repository, 19934.58s wall clock, 0.00s cache sync
Resource Usage: bottleneck: input
textshell commented 7 years ago

Rename "Resource Usage" to "Time used in" or something like that? Currently i think this sounds like some kind of magic.

enkore commented 7 years ago

Yeah, it's just the class name for now (diff, messy, so no PR now). Naming is hard :)

fadedreamz commented 3 years ago

@ThomasWaldmann I would like to give it a shot, can you kindly assign it to me?

ThomasWaldmann commented 3 years ago

@fakedreamz you're assigned (you need to "start working on this" on bountysource yourself).

After you had a look at the relevant code, maybe update the ticket with your plans.

fadedreamz commented 3 years ago

@ThomasWaldmann thanks, will do. Is there any other documentation apart from the docs folder to understand the high-level design?

ThomasWaldmann commented 3 years ago

There might be some more insights distributed over the issues here and on the mailing list, but the docs, esp. internals docs are maybe the best read.

If you search for a starting point: they are in src/borg/archiver.py, e.g. do_create implements borg create.

fadedreamz commented 3 years ago

@ThomasWaldmann would like to confirm my findings with you.

1 The process_file function has all the status for a file that is being processed.

2 The chunk_processor function can be instrumented for the following -

3 For the following, I still haven't explored them (but maybe they are in RemoteRepository)

4 No comment yet on

5 I haven't looked into process_dir/pipe/dev/dir_with_fd/fifo/symlink in FilesystemObjectProcessors. Should they be also considered for statistics?

Let me know your thoughts/comments on the findings. TIA

ThomasWaldmann commented 3 years ago

1 yes, but maybe increase the respective counter at the same place where these status chars are set rather than deriving that from the status char.

2 sounds right

3/4 remote stuff is in RemoteRepository, yes. crypto and compression is in one nested call, iirc (and not in remote.py).

5 maybe stats about file types are not too interesting. usually most stuff will be regular files and dirs. and if people are backing up block devices, they know it already.

In general it would be good to add as little code as possible for this (some code parts are already rather complex and we should try to make readability not even worse). We already have stats - can we add some attributes in there?

Please also check the comments in this ticket, some work has already been done / some ideas were evaluated.

ThomasWaldmann commented 3 years ago

For a start, maybe just do the stuff that uses simple counters.

francoo98 commented 2 years ago

Hello. I've been thinking about how to implement this so I forked the repository and pushed three short commits.

In the first one, I added a dictionary to fso.stats and added a line to the FileSystemObjectProcessor.process_file to update the dictionary once a file has been processed. This dictionary has keys corresponding to the files status A, U and M; and it's values are counters.

For the chunking, I added a variable to FileSystemObjectProcessor.stats called chunking_time to accumulate the time spent in chunking.

Finally, I measure the time spent in hashing calling the time.time method before and after calling id_hash method.

At this moment, the printing of these new variables is done at the end of the Archive.create_inner method, but I'm thinking about programming the Statistics.__add__ to make it append these new variables too.

Now, I am working on calculating the bytes sent and read from remote. For this, I think it is enough to use the attributes tx_bytes and rx_bytes in RemoteRepository

ThomasWaldmann commented 2 years ago

@francoo98 sounds good, but code review is easier if you just open a pull request against master branch.

francoo98 commented 2 years ago

@francoo98 sounds good, but code review is easier if you just open a pull request against master branch.

Ok, I opened a PR. Can I get this assigned?

ThomasWaldmann commented 2 years ago

Fixed by #6991. Thanks to @francoo98!

ThomasWaldmann commented 2 years ago

@francoo98 looks like bountysource has technical troubles (again). :-(

I clicked on "accept", but the site does not register my acceptance.

But there is a timed mechanism also, so if noone rejects, it will also pay out the bounty to you. Alternatively, we could contact bountysource support if you do not want to wait for that.

francoo98 commented 2 years ago

I'd like to contact support, it will take 2 weeks for the website to pay me otherwise

ThomasWaldmann commented 2 years ago

@francoo98 ok, then just write an email to their support and point them to the bounty and to this github page.

maybe they could finally fix that one can't accept bounties, that happened before.

ThomasWaldmann commented 2 years ago

@francoo98 the chunking_time test sometimes fails on github CI.

ThomasWaldmann commented 2 years ago

@hexagonrecursion can you try if you can accept, so @francoo98 gets this bounty?

francoo98 commented 2 years ago

I can't find the failing workflow. Can you link it?

ThomasWaldmann commented 2 years ago

It was the chunking_time > 0.0 test.

francoo98 commented 2 years ago

I understand, but I don't see it failing anywhere. I checked github actions and everything seems fine, and my pull request also says everything is fine.

ThomasWaldmann commented 2 years ago

Well, I've seen it, so please fix.

ThomasWaldmann commented 2 years ago

@francoo98 bounty "paid out", thanks to @bountysource support! \o/

ThomasWaldmann commented 2 years ago

@bountysource btw, i found out why i could not accept the bounty:

So it seems that your site has compatibility issues with safari.