Open galli-leo opened 3 years ago
@galli-leo tag andreafioraldi once you are confident the PR is done and ready :)
@galli-leo hey great to see your effort here! that gives me hope that this module is still useful.
@vanhauser-thc Yes definitely. For qemu I am getting a speedup of up to 500x, which makes even intermittent kernel panics worth it 😅. Although I haven't found any more of those, I have encountered some stalls with one of my patches. Currently trying to diagnose these. It also seems that some pages might not get properly restored, which I am investigating as well (test3.c reproduces this).
@galli-leo Are you still planning to work on this? I am working on a PR that fully implements the snapshotting of file descriptors and I have based it off this one, since it fixes a lot of build issues. I wanted to know if it will get merged soon or if I should just submit mine as is.
@galli-leo Are you still planning to work on this? I am working on a PR that fully implements the snapshotting of file descriptors and I have based it off this one, since it fixes a lot of build issues. I wanted to know if it will get merged soon or if I should just submit mine as is.
If you need it and @galli-leo stopped working on it, I can merge as is. The module is broken anyway so even is a bit less broken with this PR is better.
No no, there are a few things that are unfinished, from what I can see. If he really stopped working on this, I will just pick up from where he left and submit a new draft PR.
@galli-leo I am not completely sure what were your issues since you did not discuss them extensively. However, I think your problems were due to the snapshot_page::in_dirty_list
member not being initialized in add_snapshot_page
. Doing that resolves all the warnings.
No no, there are a few things that are unfinished, from what I can see. If he really stopped working on this, I will just pick up from where he left and submit a new draft PR.
Hey sorry for not responding earlier. I don't really have time to finish it at the moment (exams and other stuff), so feel free to pick up from here. The major issue I currently still have, is that the first page is not snapshot correctly. make test
should show that immediately (the binary does some stuff and then one page is not snapshot correctly). See test3.c
for what it does exactly.
Opening this PR as a work in progress, so that we can discuss some aspects. Specifically, I would like to see if my performance improvements also help in the other test cases. How were those run?
I have tried to squeeze more performance out of it, but at the moment it seems like there is not any obvious slow parts.
I also for now removed the snapshotting of file descriptors, since I at least don't need that.
Fixed Issues
20
15
19
14
Bug Fixes
get_cpu_var / put_cpu_var
At least for my kernel, get_cpu_var / put_cpu_var need to match up exactly. Meaning, if you have two get_cpu_var, you need to have two put_cpu_var.
blocklist / allowlist memory leak
At least AFAICT those lists where never freed.
clean_memory_snapshot: use after free of snapshot_page
Freeing the snapshot_page while iterating over the hash table means that we use the free'd element when iterating. This caused crashes for me. It's now fixed :)
There's probably more I am forgetting.
Performance Improvements
keep list of dirty pages
By keeping a list of dirty pages, we don't need to iterate over all pages when restoring a snapshot. For me this had >50% performance improvement, might be more depending on the target.
There's probably also more I am forgetting.
Maintainability improvements
remove need for patching syscall table
Using
ftrace_helper.h
, we can hook syscalls easily.switch to ftrace hooking
TODO here, we should be able to fully use
kallsyms_lookup_name
for the unexported symbols and hence no need forlookup_symbols.py
.