Closed flanfly closed 7 years ago
@flanfly I think you're working on something else right now, do you want me to start on this (I'm ok with it)?
I'll have a rolling set of questions, which I'll add occasionally to this issue:
ControlFlowTarget::Failed
; if so, if disassembly fails, what should we do, stop disassembly of the function by returning an error, yes?Meh, I don't like the current solution. It's probably better to have a separate list of disassembly failures and use smth like ControlFlowTarget::Indirect(Rvalue::Undefined)
instead.
Yea, agreed. How best to make this work is another issue tho
EDIT: I see your api proposal makes entry point non-optional; so what I said below is kind of moot :laughing:. Note, doing this will require larger changes to pipeline, and some other modules I think.
More questions/concerns:
region.name()
value? Can I remove it?Function::disassemble
or Function::new
that performs all disassembly immediately. I will play around with it though, might be more trouble than worth.The changes however are shaping up nice I think, lots of optional and match code removed, which is all a good thing imho.
It seems functions were initially introduced sort of as a placeholder for unresolved functions with names and starts; but really, that role is already satisfied by CallTarget::Todo(Rvalue::Constant
, so I don't think the argument for allowing lazy construction of Function
and then disassemble later is necessary, as CallTarget::Todo
satisifies this condition nicely.
How do i always stumble on these awful heisenbug :/; so I just introduced/exposed a bug that isn't deterministic in the tests:
Sometimes find_by_entry
test fails, sometimes not:
failures:
---- function::tests::entry_split stdout ----
thread 'function::tests::entry_split' panicked at 'assertion failed: `(left == right)`
left: `2`,
right: `3`', function.rs:1055
---- program::tests::find_by_entry stdout ----
thread 'program::tests::find_by_entry' panicked at 'assertion failed: `(left == right)`
left: `Some(AdjacencyListVertexDescriptor(0))`,
right: `Some(AdjacencyListVertexDescriptor(1))`', program.rs:236
failures:
function::tests::entry_split
program::tests::find_by_entry
failures:
---- function::tests::entry_split stdout ----
thread 'function::tests::entry_split' panicked at 'assertion failed: `(left == right)`
left: `2`,
right: `3`', function.rs:1055
failures:
function::tests::entry_split
@flanfly please see this comment; this was a big gotcha for me, didn't realize you needed to call ssa_convertion
after creating and disassembling a function, and I think it's because of the section I've pointed to, but I'm not sure: https://github.com/das-labor/panopticon/blob/36347905c59d3d61939f065064204e7d6b47b4e0/data-flow/src/ssa.rs#L105
@m4b raised the following issues on Gitter: right now mnemonic, and basic block are really simple and clean, abstract design. I like this. For the Suggested change, I really don't like the range references/indexes inside of Mnemonics and statements and basic blocks. It tightly couples mnemonics and basic blocks to the containing function (essentially the indexes are used because we don't want to copy the mnemonics, and basic blocks out of the cfg, but we can't store references in the function struct to members of itself) There number of issues this will cause I believe. So for example, I think you're right about display with and I removed it in another patch and put it in CLI where it probably belongs. It will be harder to do this because mnemonic will still have to be exposed publicly, but then really its internals which are only meaningful when used with a Function are also present I believe that 90% of issues will be fixed by leaving the cfg as the sole source of truth, and exposing iterators over the basic block
Another example of this added complexity is that if one modifies the cfg, say during disassembly, either the modifier or the modifying method now has to synchronize all of the ranges in every container that has indexes, which adds a whole boatload of complexity and potential bugs *disassembly and decompilation too Otoh, if we just expose iterators instead of slices (which only exist to iterate over afaics), then the implementation will be simpler, no synchronization is required (because all iterations are over the internal cfg), and the cfg is the "single source of truth" for the functions body Lastly if for some reason the user needs a list of mnemonics, with iterator design they can just collect it. Consequently I think the design right now is a bit of an over optimization and for both API use and simpler semantics, exposing iterators will be 90% of what we want, if not everything
We're almost there.
@flanfly
I think we need to make Function::new
absorb disassemble
, and have it be the only way to create a function as you described. Without doing this, I don't think it's possible to make entry_point
be a non-optional BasicBlock
(which is a good thing, it forces certain invariants, and removes a lot of redundant checks, etc. Also, afaics, all subsequent functions that take Function
, like ssa_convertion
, etc., require their to be a Resolved
entry point anyway, aka, a bb.). This will force Function::new
to have a Result<Self>
return value
I'm not sure of the best way to fix MnemonicOrError
; I think we want a robust parser now, and allow errors, but I don't know the best way to do it, input appreciated.
I'm not sure this is the PR that gets us "easy" mutable iterations over every statement in every mnemonic in every basic_block in a function. I believe this is actually very hard (it basically requires a mutable reference to the cfg to allow calling cfg.vertex_label_mut
and then also an immutable (or mutable) iterator over the vertexes themselves, which isn't possible without doing unsafe stuff, or other major hacks in the graph algos trait, iiuc. (i'm not a big graph theory guy, so this stuff in particular is hard for me :P). Consequently, I think it's up to the user to make mutable updates to the cfg using cfg_mut
There's some other minor points, but I think the api essentially looks like this now:
struct Function {
pub name: String,
uuid: Uuid,
entry_point: ControlFlowTargetRef,
pub cflow_graph: ControlFlowGraph, // TODO: need to remove pub, and force everyone to use getter, will be safer and easier to audit code
size: usize,
pub start: u64, // should probably remove this and just add a getter
}
impl Function {
// disassembly
pub fn new(start: u64, region: &Region, name, Option<String>) -> Function
{} // needs to be a result
pub fn disassemble<A>(&mut self,region: &Region) {} // slated to be removed
// getter
pub fn entry_point(&self) -> &ControlFlowTarget {} // Working on making this a BasicBlock
pub fn cfg(&self) -> &ControlFlowGraph {}
pub fn uuid(&self) -> &Uuid {}
// iters
// pub fn statements_iter(&self) -> Iterator<Item=&Statement> {} // not available directly right now
// pub fn mnemonics(&self) -> Iterator<Item=&Mnemonic> {} // not available directly right now
pub fn basic_blocks(&self) -> Iterator<Item=&BasicBlock> {}
// and various miscellaneous helpers
}
Refactors have allowed much nicer and clearer iteration, e.g.:
/// Returns all call targets.
pub fn collect_calls(&self) -> Vec<Rvalue> {
let mut ret = Vec::new();
for bb in self.basic_blocks() {
for statement in bb.statements() {
match statement {
&Statement { op: Operation::Call(ref t), .. } => ret.push(t.clone()),
_ => ()
}
}
}
debug!("collected calls: {:?}", ret);
ret
}
and more sensible return values + nicer code for certain functions, e.g.:
/// Returns the basic block that contains `a`.
pub fn find_basic_block_at(&self, a: u64) -> Option<&BasicBlock> {
self.basic_blocks().find(|&bb| bb.area.start <= a && bb.area.end > a)
}
To give a clearer example of the mutable graph iterator problem, this code is great example illustrating some of the issues:
/// Returns the function with UUID `a`.
pub fn find_function_by_uuid_mut<'a>(&'a mut self, a: &Uuid) -> Option<&'a mut Function> {
let ct = self.call_graph
.vertices()
.find(
|&x| match self.call_graph.vertex_label(x) {
Some(&CallTarget::Concrete(ref s)) => s.uuid() == *a,
_ => false,
}
);
if ct.is_none() {
return None;
}
match self.call_graph.vertex_label_mut(ct.unwrap()) {
Some(&mut CallTarget::Concrete(ref mut s)) => Some(s),
_ => None,
}
}
notice it first iterates the call graph, gets the vertex descriptor, then because we've let go of the immutable borrow, we now use that descriptor to get a mutable reference to a label, and then perform our update.
This will be tricky whether we expose iterators over the call graph, or whether we use a combination of vectors + a call graph with indexes into the vector...
So before I forget, I've added a size value to function. I think it'll be really cool to return the current size value in the function error type, as this will give an idea of how many bytes have been tried/"used" in case of checking how much of the binaries text section was disassembled vs errors encountered, etc. would be a neat metric I think
Thanks! That improves the API alot. Still, I think in the long run we need to get rid of BasicBlock and Mnemonic as independent types. When I look at the code we have now, none of these two are used independently. While they're part of my mental model I don't see the need to build the library around them. The cache misses caused by scattering RREIL code across the heap really slows down disassembly and analysis. I think an ideal Function type would hold all RREIL code in a single Vec w/ secondary indices marking where basic blocks and mnemonics begin. Then, we could provide iterators that generate proxy BasicBlock types on the fly.
@flanfly ya, I believe you're right; the upside of this refactor is that I think it will make transitioning to a no BasicBlock and Mnemonics much easier, since they're typically accessed now by using the iterators, instead of iterating the Cfg directly.
I believe this is done :tada:
Agreed that I think the next approach is to change the basic block and have as single vector or storage mechanism, so that, as you say, they aren't "sprayed all over the heap"
this should be much easier to do now that function internals aren't exposed as much
Now that Panopticon has grown, we have a better understanding how Function instances are used. Observations:
To make things simpler we put everything into Function and represent Mnemonic and BasicBlock instances by ranges. Also, no "empty" Function instance can exist. The only way to create a Function is to disassemble code.
/cc @m4b