New Function API - Githubissues

flanfly commented 7 years ago

Now that Panopticon has grown, we have a better understanding how Function instances are used. Observations:

Typical tasks are getting the entry point address and iterating all instructions are tedious b/c you need to iterate the cfg directly,
Functions w/o entry points are non-existing
BasicBlock and Mnemonic instances aren't used much outside of Function.

To make things simpler we put everything into Function and represent Mnemonic and BasicBlock instances by ranges. Also, no "empty" Function instance can exist. The only way to create a Function is to disassemble code.

struct BasicBlock {
    statements: Range<StatementIndex,StatementIndex>,
    mnemonics: Range<MnemonicIndex,MnemonicIndex>,
    area: Range<u64,u64>,
    vertex: ControlFlowRef,
}

struct Mnemonic {
    area: Range<u64,u64>,
    opcode: Cow<'static,str>,
    operands: Vec<Rvalue>,
    format_string: Vec<MnemonicFormatToken>,
}

enum ControlFlowTarget {
    BasicBlock(BasicBlockIndex),
    Indirect(Rvalue),
}

struct BasicBlockIndex {
    index: usize
};

struct MnemonicIndex {
    index: usize
};

struct StatementIndex {
    index: usize
};

type ControlFlowGraph = StableGraph<ControlFlowTarget,Guard>;

struct Function {
    pub name: Cow<'static,str>,
    uuid: Uuid,
    entry_point: BasicBlockIndex,
    statements: Vec<Statement>,
    basic_blocks: Vec<BasicBlock>,
    mnemonics: Vec<Mnemonics>,
    cflow_graph: ControlFlowGraph,
}

impl Function {
    // disassembly
    pub fn new(start: u64, region: &Region, name; Option<Cow<'static,str>>) -> Result<Function>
    {}
    pub fn continue(&mut self,region: &Region) -> Result<bool> {}

    // getter
    pub fn entry_point(&self) -> BasicBlockIndex {}
    pub fn statements(&self) -> &[Statement] {}
    pub fn mnemonics(&self) -> &[Mnemonics] {}
    pub fn basic_blocks(&self) -> &[BasicBlock] {}
    pub fn cflow_graph(&self) -> &ControlFlowGraph {}
    pub fn uuid(&self) -> &Uuid {}

    // iters
    pub fn statements_iter(&self) -> Iterator<Item=&Statement> {}
    pub fn mnemonics_iter(&self) -> Iterator<Item=(&Mnemonic,&[&Statement])> {}
    pub fn basic_blocks_iter(&self) -> Iterator<Item=(&BasicBlock,&[&Mnemonic])> {}
    pub fn statements_iter_mut(&mut self) -> Iterator<Item=&mut Statement> {}
    pub fn mnemonics_iter_mut(&mut self) -> Iterator<Item=(&mut Mnemonic,&[&mut Statement])> {}

    // aux
    pub fn first_address(&self) -> u64 {}
    pub fn has_unresolved_jumps(&self) -> bool {}
    pub fn resolve_jump(&self,Rvalue,u64) -> Result<bool> {}

    // insert
    pub fn insert_statement(stmt: Statement, mnemonic: MnemonicIndex,
        basic_block: BasicBlockIndex, index: usize) -> Result<()> {}
    pub fn insert_pseudo_mnemonic(stmts: &[Statement], opcode: Cow<'static,str>,
        basic_block: BasicBlock, index: usize) -> Result<()> {}
}

/cc @m4b

m4b commented 7 years ago

@flanfly I think you're working on something else right now, do you want me to start on this (I'm ok with it)?

m4b commented 7 years ago

I'll have a rolling set of questions, which I'll add occasionally to this issue:

Do you want to remove ControlFlowTarget::Failed; if so, if disassembly fails, what should we do, stop disassembly of the function by returning an error, yes?

flanfly commented 7 years ago

Meh, I don't like the current solution. It's probably better to have a separate list of disassembly failures and use smth like ControlFlowTarget::Indirect(Rvalue::Undefined) instead.

m4b commented 7 years ago

Yea, agreed. How best to make this work is another issue tho

m4b commented 7 years ago

EDIT: I see your api proposal makes entry point non-optional; so what I said below is kind of moot :laughing:. Note, doing this will require larger changes to pipeline, and some other modules I think.

More questions/concerns:

~~Why does function take (and store) the region.name() value? Can I remove it?~~
Exposing an optional entry point is strange; I can see why it was done, but I think that with new refactor we have the chance to remove this, and greatly reduce api friction (optionals and results in structs are a pain to deal with). If the function fails to disassemble it's entry point (e.g., the start location passed to it), then basically the entire function disassembly is broken, and should be considered a failure. I think we can fix this by only exposing a Function::disassemble or Function::new that performs all disassembly immediately. I will play around with it though, might be more trouble than worth.

The changes however are shaping up nice I think, lots of optional and match code removed, which is all a good thing imho.

It seems functions were initially introduced sort of as a placeholder for unresolved functions with names and starts; but really, that role is already satisfied by CallTarget::Todo(Rvalue::Constant, so I don't think the argument for allowing lazy construction of Function and then disassemble later is necessary, as CallTarget::Todo satisifies this condition nicely.

m4b commented 7 years ago

How do i always stumble on these awful heisenbug :/; so I just introduced/exposed a bug that isn't deterministic in the tests:

Sometimes find_by_entry test fails, sometimes not:

failures:

---- function::tests::entry_split stdout ----
    thread 'function::tests::entry_split' panicked at 'assertion failed: `(left == right)`
  left: `2`,
 right: `3`', function.rs:1055

---- program::tests::find_by_entry stdout ----
    thread 'program::tests::find_by_entry' panicked at 'assertion failed: `(left == right)`
  left: `Some(AdjacencyListVertexDescriptor(0))`,
 right: `Some(AdjacencyListVertexDescriptor(1))`', program.rs:236

failures:
    function::tests::entry_split
    program::tests::find_by_entry

failures:

---- function::tests::entry_split stdout ----
    thread 'function::tests::entry_split' panicked at 'assertion failed: `(left == right)`
  left: `2`,
 right: `3`', function.rs:1055

failures:
    function::tests::entry_split

m4b commented 7 years ago

@flanfly please see this comment; this was a big gotcha for me, didn't realize you needed to call ssa_convertion after creating and disassembling a function, and I think it's because of the section I've pointed to, but I'm not sure: https://github.com/das-labor/panopticon/blob/36347905c59d3d61939f065064204e7d6b47b4e0/data-flow/src/ssa.rs#L105

flanfly commented 7 years ago

@m4b raised the following issues on Gitter: right now mnemonic, and basic block are really simple and clean, abstract design. I like this. For the Suggested change, I really don't like the range references/indexes inside of Mnemonics and statements and basic blocks. It tightly couples mnemonics and basic blocks to the containing function (essentially the indexes are used because we don't want to copy the mnemonics, and basic blocks out of the cfg, but we can't store references in the function struct to members of itself) There number of issues this will cause I believe. So for example, I think you're right about display with and I removed it in another patch and put it in CLI where it probably belongs. It will be harder to do this because mnemonic will still have to be exposed publicly, but then really its internals which are only meaningful when used with a Function are also present I believe that 90% of issues will be fixed by leaving the cfg as the sole source of truth, and exposing iterators over the basic block

Another example of this added complexity is that if one modifies the cfg, say during disassembly, either the modifier or the modifying method now has to synchronize all of the ranges in every container that has indexes, which adds a whole boatload of complexity and potential bugs *disassembly and decompilation too Otoh, if we just expose iterators instead of slices (which only exist to iterate over afaics), then the implementation will be simpler, no synchronization is required (because all iterations are over the internal cfg), and the cfg is the "single source of truth" for the functions body Lastly if for some reason the user needs a list of mnemonics, with iterator design they can just collect it. Consequently I think the design right now is a bit of an over optimization and for both API use and simpler semantics, exposing iterators will be 90% of what we want, if not everything

m4b commented 7 years ago

We're almost there.

@flanfly

I think we need to make Function::new absorb disassemble, and have it be the only way to create a function as you described. Without doing this, I don't think it's possible to make entry_point be a non-optional BasicBlock (which is a good thing, it forces certain invariants, and removes a lot of redundant checks, etc. Also, afaics, all subsequent functions that take Function, like ssa_convertion, etc., require their to be a Resolved entry point anyway, aka, a bb.). This will force Function::new to have a Result<Self> return value
I'm not sure of the best way to fix MnemonicOrError; I think we want a robust parser now, and allow errors, but I don't know the best way to do it, input appreciated.
I'm not sure this is the PR that gets us "easy" mutable iterations over every statement in every mnemonic in every basic_block in a function. I believe this is actually very hard (it basically requires a mutable reference to the cfg to allow calling cfg.vertex_label_mut and then also an immutable (or mutable) iterator over the vertexes themselves, which isn't possible without doing unsafe stuff, or other major hacks in the graph algos trait, iiuc. (i'm not a big graph theory guy, so this stuff in particular is hard for me :P). Consequently, I think it's up to the user to make mutable updates to the cfg using cfg_mut

There's some other minor points, but I think the api essentially looks like this now:

struct Function {
    pub name: String,
    uuid: Uuid,
    entry_point: ControlFlowTargetRef,
    pub cflow_graph: ControlFlowGraph, // TODO: need to remove pub, and force everyone to use getter, will be safer and easier to audit code
    size: usize,
    pub start: u64, // should probably remove this and just add a getter
}

impl Function {
    // disassembly
    pub fn new(start: u64, region: &Region, name, Option<String>) -> Function
    {} // needs to be a result
    pub fn disassemble<A>(&mut self,region: &Region) {} // slated to be removed

    // getter
    pub fn entry_point(&self) -> &ControlFlowTarget {} // Working on making this a BasicBlock
    pub fn cfg(&self) -> &ControlFlowGraph {}
    pub fn uuid(&self) -> &Uuid {}

    // iters
    // pub fn statements_iter(&self) -> Iterator<Item=&Statement> {} // not available directly right now
    // pub fn mnemonics(&self) -> Iterator<Item=&Mnemonic> {} // not available directly right now
    pub fn basic_blocks(&self) -> Iterator<Item=&BasicBlock> {}

   // and various miscellaneous helpers
}

Refactors have allowed much nicer and clearer iteration, e.g.:

    /// Returns all call targets.
    pub fn collect_calls(&self) -> Vec<Rvalue> {
        let mut ret = Vec::new();
        for bb in self.basic_blocks() {
            for statement in bb.statements() {
                match statement {
                    &Statement { op: Operation::Call(ref t), .. } => ret.push(t.clone()),
                    _ => ()
                }
            }
        }
        debug!("collected calls: {:?}", ret);
        ret
    }

and more sensible return values + nicer code for certain functions, e.g.:

    /// Returns the basic block that contains `a`.
    pub fn find_basic_block_at(&self, a: u64) -> Option<&BasicBlock> {
        self.basic_blocks().find(|&bb| bb.area.start <= a && bb.area.end > a)
    }

m4b commented 7 years ago

To give a clearer example of the mutable graph iterator problem, this code is great example illustrating some of the issues:

    /// Returns the function with UUID `a`.
    pub fn find_function_by_uuid_mut<'a>(&'a mut self, a: &Uuid) -> Option<&'a mut Function> {
        let ct = self.call_graph
            .vertices()
            .find(
                |&x| match self.call_graph.vertex_label(x) {
                    Some(&CallTarget::Concrete(ref s)) => s.uuid() == *a,
                    _ => false,
                }
            );

        if ct.is_none() {
            return None;
        }

        match self.call_graph.vertex_label_mut(ct.unwrap()) {
            Some(&mut CallTarget::Concrete(ref mut s)) => Some(s),
            _ => None,
        }
    }

notice it first iterates the call graph, gets the vertex descriptor, then because we've let go of the immutable borrow, we now use that descriptor to get a mutable reference to a label, and then perform our update.

This will be tricky whether we expose iterators over the call graph, or whether we use a combination of vectors + a call graph with indexes into the vector...

m4b commented 7 years ago

So before I forget, I've added a size value to function. I think it'll be really cool to return the current size value in the function error type, as this will give an idea of how many bytes have been tried/"used" in case of checking how much of the binaries text section was disassembled vs errors encountered, etc. would be a neat metric I think

flanfly commented 7 years ago

Thanks! That improves the API alot. Still, I think in the long run we need to get rid of BasicBlock and Mnemonic as independent types. When I look at the code we have now, none of these two are used independently. While they're part of my mental model I don't see the need to build the library around them. The cache misses caused by scattering RREIL code across the heap really slows down disassembly and analysis. I think an ideal Function type would hold all RREIL code in a single Vec w/ secondary indices marking where basic blocks and mnemonics begin. Then, we could provide iterators that generate proxy BasicBlock types on the fly.

m4b commented 7 years ago

@flanfly ya, I believe you're right; the upside of this refactor is that I think it will make transitioning to a no BasicBlock and Mnemonics much easier, since they're typically accessed now by using the iterators, instead of iterating the Cfg directly.

m4b commented 7 years ago

I believe this is done :tada:

Agreed that I think the next approach is to change the basic block and have as single vector or storage mechanism, so that, as you say, they aren't "sprayed all over the heap"

this should be much easier to do now that function internals aren't exposed as much

das-labor / panopticon

New Function API #308