das-labor / panopticon

A libre cross-platform disassembler.
https://panopticon.re
GNU General Public License v3.0
1.43k stars 78 forks source link

memory range interval tree multiset #323

Open m4b opened 7 years ago

m4b commented 7 years ago

I believe it is necessary to have a new data structure in project/program which maps virtual memory ranges to values.

Getting the right design down will be crucial, but I think it will premeditate many new features.

Design

I propose a new object, as generic as possible, which:

  1. Uses an interval tree for the backing data structure.
  2. Uses some kind of mechanism which allows defined, but arbitrary values (perhaps some kind of extendable tagged union, or a key-value store which is initialized on interval tree creation) to be inserted into the backing store.

Here are some use cases:

Values

A key-value list could be a nice interesting approach; but it will likely be complex implementation, and I'd have to think about it.

E.g., an extremely rough sketch with basic enum which we use seems fastest approach:

pub enum Permission {
  Read,
  ReadWrite,
  Execute,
  .. bla bla
}

pub enum MemoryInfo {
  Sector { name: String, permissions: Permission },
}
let itree = IntervalTree::new::<MemoryInfo>();
for section in elf.section {
  itree.insert(Interval::new(section.start, section.end),  MemoryInfo::Sector { name: section.name, section.permissions() });
}

Unfortunately this approach, besides perhaps the interval tree data structure itself, won't be as extensible. imho. Or rather it will be specific to panopticon itself.

While this is fine, I think we can do better (just not sure how yet ;))

The reason I'd not want it specific to panopticon is that I don't think it's necessary, but for example, I want to add interval based section/segment/symbol coloring to bingrep. I've actively been working on this; it would use goblin to fill up the ranges with metadata; but this is exactly what panopticon needs too.

I added this functionality to rdr via something I called bytecoverage; basically its a stupid name for tagging ranges with information (super revolutionary, I know). https://github.com/m4b/rdr/blob/master/lib/utils/ByteCoverage.ml

The problem is the backing data structures were just so badly implemented, the intervals never worked quite right, sometimes trampled other intervals, etc., so I'd like to really nail the data structures down before serious work on the values.

The added benefit of having it be a separate crate is that I could implement the byte ranges as say, goblin_coverage, which given an interval tree, inserts into it interval data specific to binary segments, symbols, etc.

Then, when panopticon uses the crate goblin_coverage or bingrep, or whoever, they'd all benefit from the "single source of truth", which means more testing, more bug reports, and better all around service.

I'd really love some feedback/brainstorming about how to get a good data structure that:

  1. is an interval tree
  2. the implementation is as efficient as we can get it, algorithmically speaking
  3. the values of the tree are generic
  4. our shared implementation uses a defined generic value, but which is extendable perhaps statically before construction, but which could benefit from another crate inserting this "generic" value into the tree also

So it sounds like 2 crates:

  1. an interval tree crate (we may be able to find a good working one)
  2. a crate which exposes a generic value, and as a basic service, can initialize its reified tree given say a goblin object, with the basic segment memory mappings for that object
flanfly commented 7 years ago

Have you seen this: https://github.com/theban/interval-tree? I don't know if @eqv still maintains it, but I think it's a good starting point.

flanfly commented 7 years ago

The memory section feature you describe overlaps a bit w/ the Layer/Region idea that's already implemented. I'm not happy with it's design, tho. So, the interval tree may subsume them somehow.

m4b commented 7 years ago

I'll check out layer/region; I'm mostly concerned about how to extensibly add values to the memory range data structure; probably enum is best. Specifically, the approach that BAP takes is very interesting: http://binaryanalysisplatform.github.io/memory

theban interval tree i've seen in the past, it looks good, has contains and range which are crucial:

    for (i,pair) in t.range(34, 36).enumerate() {
        //[...]
    }

It doesn't appear to build in the multiset ability, might have to write a thin wrapper; which would probably go in line with having a datatype the wrapper exposes for the insertable values?

eqv commented 7 years ago

Hey, I wrote theban/interval-tree exactly for this purpose. If you want to use it, feel free! If there are any problems, I would gladly help.