Open paulkermann opened 1 year ago
It sounds very useful and easy to implement, but we do not have the manpower right now to implement it as a priority feature. Do you know enough about cle to implement it by yourself? We can offer help as needed!
Sadly I don't know much about cle. I am willing to try though.
I tried using add_backer
in a new Backend
class(which I also registered with register_backend
) to add a class that inherits from Clemory
but that did not really work for me.
Do you perhaps have any hints or tips as to what I need to implement to make this work?
Maybe you can already do what you want!
It seems to me that Clemory.add_backer()
supports mmap
-type data
parameter. Can you mmap the large file to memory and then add it as a backer to the existing Clemory in loader?
My process address space is not large enough. The dump is really big. I need to be able to "hook" the read functionality so only needed memory is read. mmaping
sadly does not solve my problem.
What file format are you doing this for?
Currently this is a minidump. However I want that to be abstracted away so I could use it from live process memory.
The perfect abstraction for me is a read
function that I can implement however I want
So it looks from my incredibly brief interrogation that the library we use to parse minidump files does in fact support the interface that you're looking for. What you will need to do is to create a LazyMinidumpSegment
class which implements the Clemory interface, and then in the loop in cle/backends/minidump/__Init__.py:61
instead of saying data = segment.read(...)
and self.memory.add_backer(..., data)
, you should say lazy_segment = LazyMinidumpSegment(segment, self._mdf.file_handle)
and self.memory.add_backer(..., lazy_segment)
. You may also want to hook a LRU cache of some sort into the loop so that you can re-lazify these segments as you experience memory pressure.
Be warned that because you are storing a file descriptor, this will leak file descriptor references (cle is entirely designed to have zero file descriptors open after loading is done, there used to be a close
method but we made it obsolete) and will not be thread safe.
In terms of live process memory, you probably want to look into symbion. I'm not a huge fan of its design, but there are a lot of people using it.
I don't want to specify segments beforehand. I want one large segment that start at address 0 with size of 0xffffffffffffffff that every read from will let me run custom code. It needs to be a new Backend
.
oh. uh, I guess that's technically something we can do, though it will mess a huge amount of the static analysis up which assumes that it can enumerate a list of mapped addresses. Let me put something together for you.
@rhelmot that will be incredible, thanks you so much!
I don't need static analysis, I just need the angr state to be able to use call_state
and run code from specified input from me
Thanks in advance :)
Take a look at this! https://github.com/angr/cle/compare/feat/lazy
I have not tested it even in making sure it imports, but it should be the right framework for what you want to do.
I'm not really sure on how to use this.
I understand I need to create a class that the implements LazyBackend
thingy.
I have tried doing the thing below:
class my_lazy(cle.backends.lazy.LazyBackend):
def __init__(self, *args, **kwargs):
super().__init__("", archinfo.arch_amd64.ArchAMD64(), 0, **kwargs)
def _load_data(self, addr, size):
return b"\x01" * size
register_backend("lazy", my_lazy)
Also, it looks like on https://github.com/angr/cle/blob/a77bcdc9eaca684826cd8792d980c11ba42220ae/cle/memory.py#L354
It should be isinstance(backer, Clemory)
and not type(backer) is Clemory
.
When I use this code
stream = BytesIO(b"\x55" * 0x2000)
proj = angr.Project(stream, main_opts={"backend": "lazy"})
state = proj.factory.call_state(0x1337)
print(state.memory.load(0x1337, 10))
It does not work (I would expect that the load would return a concrete value). However it looks like even the \x55
are not returned
Logs:
INFO | 2022-09-22 11:05:13,863 | angr.project | Loading binary from stream
DEBUG | 2022-09-22 11:05:13,863 | cle.loader | ... loading with <class '__main__.my_lazy'>
INFO | 2022-09-22 11:05:13,864 | cle.loader | Linking
INFO | 2022-09-22 11:05:13,864 | cle.loader | Mapping at 0x0
INFO | 2022-09-22 11:05:13,864 | cle.loader | Linking cle##externs
INFO | 2022-09-22 11:05:13,865 | cle.loader | Mapping cle##externs at 0x8000000000000000
DEBUG | 2022-09-22 11:05:13,865 | angr.project | hooking 0x8000000000000000 with <SimProcedure CallReturn>
DEBUG | 2022-09-22 11:05:13,865 | angr.project | hooking 0x8000000000000008 with <SimProcedure UnresolvableJumpTarget>
DEBUG | 2022-09-22 11:05:13,865 | angr.project | hooking 0x8000000000000010 with <SimProcedure UnresolvableCallTarget>
WARNING | 2022-09-22 11:05:13,865 | angr.calling_conventions | Guessing call prototype. Please specify prototype.
DEBUG | 2022-09-22 11:05:13,866 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0xb8, 8, Iend_LE) = <BV64 0x1337>
DEBUG | 2022-09-22 11:05:13,867 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0xb8, 8, Iend_LE) = <BV64 0x1337>
DEBUG | 2022-09-22 11:05:13,867 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0x30, 8, Iend_LE) = <BV64 0x7ffffffffff0000>
DEBUG | 2022-09-22 11:05:13,867 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0x30, 8, Iend_LE) = <BV64 0x7ffffffffff0000>
DEBUG | 2022-09-22 11:05:13,867 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0x30, 8, Iend_LE) = <BV64 0x7ffffffffff0000>
DEBUG | 2022-09-22 11:05:13,867 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0x30, 8, Iend_LE) = <BV64 0x7fffffffffefff8>
DEBUG | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0x30, 8, Iend_LE) = <BV64 0x7fffffffffefff8>
DEBUG | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0x30, 8, Iend_LE) = <BV64 0x7fffffffffefff8>
WARNING | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.default_filler_mixin | The program is accessing memory with an unspecified value. This could indicate unwanted behavior.
WARNING | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.default_filler_mixin | angr will cope with this by generating an unconstrained symbolic variable and continuing. You can resolve this :
WARNING | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.default_filler_mixin | 1) setting a value to the initial state
WARNING | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.default_filler_mixin | 2) adding the state option ZERO_FILL_UNCONSTRAINED_{MEMORY,REGISTERS}, to make unknown regions hold null
WARNING | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.default_filler_mixin | 3) adding the state option SYMBOL_FILL_UNCONSTRAINED_{MEMORY,REGISTERS}, to suppress these messages.
DEBUG | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | reg.load(0xb8, 8, Iend_LE) = <BV64 0x1337>
WARNING | 2022-09-22 11:05:13,868 | angr.storage.memory_mixins.default_filler_mixin | Filling memory at 0x1337 with 10 unconstrained bytes referenced from 0x1337 (offset 0x1337 in main binary (0x13)
DEBUG | 2022-09-22 11:05:13,868 | angr.state_plugins.solver | Creating new unconstrained BV named mem_1337
DEBUG | 2022-09-22 11:05:13,869 | angr.storage.memory_mixins.paged_memory.paged_memory_mixin | mem.load(0x1337, 10, Iend_BE) = <BV80 mem_1337_1_80{UNINITIALIZED}>
<BV80 mem_1337_1_80{UNINITIALIZED}>
I would be glad to have more assistance :)
I was really hoping you would be able to take it from here... nonetheless, I have pushed more changes to the branch such that your example now works.
Seems like the _load_data
function gets called, but with an unexpected address (0x7fffffffffef000).
I have added
bla = state.memory.load(0x5000, 10)
print(bla)
It appears to enter an infinite loop in line 139 with real_start
being 0x500 and real_end
being 0x7fffffffffef000 (from before).
And real_end - real_start + self.resident
becomes huge and more then _real_max_resident
.
Also, is there a way to make the whole 64 bit address space available ?(and not just the lower half). I have tried replacing the
memory_map = [(0, 2**(self.arch.bits - 1))]
with memory_map = [(0, 2**(self.arch.bits))]
but that did not really work.
Thanks for the big help so far !!
There is probably a bug somewhere in the stuff I wrote. Feel free to make whatever changes you need to make it work; I am out of cycles to help with this.
You will have a very bad time getting angr to accept you mapping the entire 64 bit address space. angr needs to map some additional object files into free slots of the memory map in order to support things like call_state and simprocedures, and will complain very loudly if it can't do that.
Hey, I worked on it a bit in my branch here.
Could you check it out and comment your opinion. I also provided test_lazy.py
that tests an example backend.
Some thoughts:
Are you looking to contribute this back upstream at some point?
I do want to contribute this upstream, yes. 1) Yea I have the problem of fitting all the addresses into the host memory. But the current implementation lets angr "lazily" fetch whatever memory it needs at that point and this way not all of the memory is loaded at once so it works so evicting does not really benefit me 2) Sure I have added it back. 3) how do I need to change the backers to make it work properly again?
Hey, I want a backend class which will allow me to let the cle engine (and angr) get the needed bytes lazily. This way I will be able to have sparse address space (similar to the minidump) without loading all of it beforehand. I want this backend to have one parameter which will be a
read
function with address and size parameters.When angr or cle wants to read memory it will read from the cached copy if it exists, otherwise the
read
function will be called for that address and a cached copy will be created. When angr (cle) wants to write somewhere the original bytes will be read into a cached copy and then the write operation will happen. If the cache already exists then angr will just write there.I have a huge minidump file and loading all the segments at initialization causes an out of memory error for my python. In the end I want to use the angr engine with this backend and use the common normal functionality (like
explore
and such).Thanks in advance and looking forward to hearing your opinion on this :)