Open ELF's from Memory - Githubissues

rpm5099 commented 1 year ago

Could you add a change for elftools.elf.elffile.ELFFile to accept bytes, bytearray, and memoryview types in addition to filename/file handle? Any elf already in memory requires either a [useless] class to make bytes look like an IO object, be converted into a BytesIO object (which duplicates data in memory), or be unnecessarily written to disk. Thanks.

sevaa commented 1 year ago

Perversely, if we ever get to address #479, this will provide a pathway to this one.

That said - how bad is duplication in memory, exactly?

rpm5099 commented 1 year ago

I have not done actual measurements, but at a minimum converting memoryview to bytes duplicates the contents in memory of whatever is in the memoryview plus I'm sure some python overheard associated with the bytes class. As an alternative it is possible to wrap memoryview or bytes in a class that makes it behave like a file handle so that you don't have to modify the other routines that accept it (i.e. so it has read, tell, seek etc. methods that do the equivalent). If things are modified to accept bytes, then simply allowing memoryview in any type checking and treating it the same as bytes usually works fine.

This makes it much more convenient if passing data in memory, such as a file downloaded or extracted within python . This isn't going to help with #479, all it would do is prevent duplication of the original 914MB elf in memory before things got started.

sevaa commented 1 year ago

The docs state that BytesIO accepts any object with Buffer protocol - memoryview included. Have you tried constructing a BytesIO around a memoryview and feeding it to the ELFFile constructor? It should work.

However, once you get to DWARF parsing (and you are after that, are you?), the first thing pyelftools does is reads all debug related sections into bytes objects. So that is duplication, if only of the DWARF related sections. You can work around that too, if you construct a BytesIO around a memoryview for each section and call the DWARFInfo constructor directly. I do that for non-ELF files in DWEX - see here, in the Mach-O and PE handling code. Except I construct my BytesIO objects around bytes.

This portion is relatively fragile between pyelftools' versions - as the DWARF standard grows more sections, we add more arguments to the DWARFInfo constuctor. No way around that.

For some guidance how to pull DWARF sections from ELF, see get_dwarf_info() in pyelftools' elftools/elf/elffile.py. Supporting compressed sections will be a problem (gzipped data are not seekable), but if your binaries don't have them, you can ignore that scenario.

EDIT: tested the BytesIO scenario, it works:

with open("test\\testfiles_for_readelf\\dwarf_lineprogramv5.elf", "rb") as f:
    b = f.read(1000000)

with ELFFile(BytesIO(memoryview(b))) as elf:
    di = elf.get_dwarf_info()
    CUs = list(di.iter_CUs()) # Hello DWARF
    exit(0)

Where pyelftools could make your life easier, we can add a check for already in-memory streams (if isinstance(stream, BytesIO) basically) to the section retrieval logic to avoid data duplication there. As for the compressed sections, we'll still have to read and decompress those. But asking to be relieved of the terrible burden of having to construct an extraneous BytesIO is, frankly, a but too much.

OBTW, the ELFFile constructor doesn't accept a filename. It accepts a stream object.

rpm5099 commented 1 year ago

The docs state that BytesIO accepts any object with Buffer protocol - memoryview included. Have you tried constructing a BytesIO around a memoryview and feeding it to the ELFFile constructor? It should work.

Right, but it duplicates the memory. We seem to be focusing on the less important reason for accepting memory buffers (bytes/memoryview) which is being able to pass files directly from other applications within python without having to io wrap them. I was just providing some options for ways it could be done without duplicating memory. The memory/speed advantage only affects higher volumes. I didn't mean by my suggestion that all memory allocation should be eliminated throughout the processing. I will point out though that since memoryviews use pointers, slicing them into new objects doesn't duplicate the memory.

Sticking with io throughout certainly gives the ability to manage how much of the file ends up in memory if its a file handle to disk, but it seems here that most/all of the file ends up in memory anyways right?

sevaa commented 1 year ago

There are two levels of memory duplication, in a way.

First, when ELFTools constructs a DWARFInfo, it reads every DWARF section from the underlying stream (typically a file, in your case I guess not) into a bytes-like object and immediately writes that into a BytesIO object with no explicit backing store; assuming the intermediate buffer is ditched immediately after. Optionally, it decompresses the gzipped sections on that step. If the underlying stream is a file, there is no duplication there except fleetingly, and no more than one section at a time.

Second, as the library parses, there is aggressive caching of parsed objects - on the DWARFInfo level and on the CU level. That is harder to avoid.

The first bit could have been replaced by some kind of slicing-in-place logic if the original stream is a memory one. The unbuffered, read-only version of BytesIO is not that hard to write. But again, if the sections are compressed, it's all for naught. seek is all over the place; parsing without seeking is possible in theory, but a massive rewrite in practice.

Give me the bigger picture please. Are you generating ELF files in memory, or reading them to memory and then parsing? Are you running into out-of-memory conditions in practice and think duplication is to blame? Because "let's read the file into memory and reuse that buffer as much as possible" is not the same as "let's not use a ton of memory". The former is one approach to the latter, but no the only one. Notably, it's the subject of #479, :) I have an alternative approach to #479 in mind, it will be slow and I/O-heavy but cheaper on memory.

rpm5099 commented 1 year ago

This is just about being able to pass bytes/memoryview in addition to filename or io object. Yes, to both generating and reading them to memory. Parsing larger amounts of files with high performance is not the goal of this project, but there's a pretty straightforward way to allow bytes/memoryview that doesn't require duplication of the file data and memory overhead and copying happening right at the start.

When I get a chance I'll make a branch with those changes and you can take a look and see what you think.

sevaa commented 1 year ago

If you write a read-only stream class (with read, seek, and tell) backed by a memoryview, and pass that to ELFFile() it will address a good portion of your concerns. No changes to pyelftools necessary.

EDIT: here is one. Works with pyelftools, I've checked. Makes copies on reading, 'cause pyelftools expects, here and there, the result of read() to be bytes, and memoryview slices are not.

class memoryviewstream():
    def __init__(self, mv):
        self.memview = mv
        self.pos = 0

    def tell(self):
        return self.pos

    def seek(self, offset, whence = os.SEEK_SET):
        if whence == os.SEEK_SET:
            self.pos = offset        
        elif whence == os.SEEK_CUR:
            self.pos += offset
        elif whence == os.SEEK_END:
            self.pos = len(self.memview) + offset
        else:
            raise ValueError()

    def read(self, length):
        n = len(self.memview)
        if self.pos >= n:
            r = b''
            length = 0
        else:
            if self.pos + length > n:
                length = n - self.pos
            r = bytes(self.memview[self.pos:self.pos + length])
        self.pos += length
        return r

    def close(self):
        self.memview = None

Some of the edge case behaviors of BytesIO were deliberately mimicked here.

This could be further enhanced if the section data retrieval logic in ELFFile. would slice instead of reading with copying (absent compression), but that would be a change (or a monkeypatch) to pyelftools proper.

Relocations is another thing that will get in the way of plain slicing. By default, ELFFile applies the relocations...

rpm5099 commented 1 year ago

Yes that's pretty much what I was talking about. The suggestion is to make that part of pyelftools so that each individual user doesnt have to come up with that class in order to pass bytes or memoryview.

sevaa commented 1 year ago

@eliben: what do you think of this? I don't want to mess too much with the existing ELFFile, but I could slap together a subclass of ELFFile that would take a Buffer-like object.

Relocation would be a bit of a challenge - you can relocate a bytearray in place but not bytes.

Won't be my first priority.

eliben commented 1 year ago

If this is orthogonal to the current code, it makes more sense to me to add it as an example rather than as a new exported class in pyelftools.

sevaa commented 1 year ago

Okay, here is a further optimized attempt:

from elftools.elf.elffile import ELFFile
from elftools.elf.relocation import RelocationHandler
from elftools.dwarf.dwarfinfo import DebugSectionDescriptor
from elftools.common.exceptions import ELFError

class bufferstream():
    def __init__(self, buf):
        self.buffer = buf
        self.pos = 0

    def tell(self):
        return self.pos

    def seek(self, offset, whence = os.SEEK_SET):
        if whence == os.SEEK_SET:
            self.pos = offset        
        elif whence == os.SEEK_CUR:
            self.pos += offset
        elif whence == os.SEEK_END:
            self.pos = len(self.buffer) + offset
        else:
            raise ValueError()

    def read(self, length):
        return bytes(self.slice(self.pos, length))

    def slice(self, pos, length):
        n = len(self.buffer)
        if pos >= n:
            r = b''
            length = 0
        else:
            if pos + length > n:
                length = n - pos
            r = self.buffer[pos:pos + length]
        self.pos = pos + length
        return r

    def close(self):
        self.buffer = None

class ELFFileBuffer(ELFFile):
    """ Creation: the constructor accepts a Buffer-like object - bytes, bytearray, or memoryview
    """
    def __init__(self, buffer):
        ELFFile.__init__(self, bufferstream(buffer))

    def _read_dwarf_section(self, section, relocate_dwarf_sections):
        if section.header['sh_type'] == 'SHT_NOBITS':
            section_stream = bufferstream(b'\0'*self.data_size)
        elif not section.compressed:
            section_stream = bufferstream(section.stream.slice(section['sh_offset'], section._decompressed_size))
        else:
            return bufferstream(section.data()) # Reuse the decompression from the base

        if relocate_dwarf_sections:
            reloc_handler = RelocationHandler(self)
            reloc_section = reloc_handler.find_relocations_for_section(section)
            if reloc_section is not None:
                raise ELFError("DWARF relocations are not supported")
                #reloc_handler.apply_section_relocations(
                #        section_stream, reloc_section)

        return DebugSectionDescriptor(
                stream=section_stream,
                name=section.name,
                global_offset=section['sh_offset'],
                size=section.data_size,
                address=section['sh_addr'])

ELFFileBuffer takes a bytes/bytearray/memoryview. For now, I've shorted out relocations. @rpm5099: see if swapping ELFFile for ELFFileBuffer makes a difference in your memory use situation.

rpm5099 commented 1 year ago

This looks good, here is a version of it that I use sometimes, but it did give me a few exceptions related to UTF conversions IIRC when using with the current version of the library.

class MemoryIO(io.RawIOBase):
    __slots__ = ('buffer', 'offset')

    def __init__(self, buffer):
        self.buffer = memoryview(buffer).toreadonly()
        self.offset = 0

    def __getattr__(self, name):
            if not hasattr(self.buffer, name):
                raise AttributeError(f"Neither MemoryIO nor memoryview has an attribute called {name}")
            return getattr(self.buffer, name)

    def __getitem__(self, *args, **kwargs):
        return self.buffer.__getitem__(*args, **kwargs)

    def seekable(self):
        return True

    def seek(self, offset, whence=io.SEEK_SET):
        if whence == io.SEEK_SET:
            self.offset = offset
        elif whence == io.SEEK_CUR:
            self.offset = self.offset + offset
        elif whence == io.SEEK_END:
            self.offset = len(self.buffer) + offset

        if self.offset < 0:
            self.offset = 0
        elif self.offset > len(self.buffer):
            self.offset = len(self.buffer)

        return self.offset

    def tell(self):
        return self.offset

    def write(self, *args):
        raise io.UnsupportedOperation('write')

    def readable(self):
        return self.offset != len(self.buffer)

    def read(self, size=-1):
        if size == -1:
            size = len(self.buffer) - self.offset
        elif len(self.buffer) - size < self.offset:
            size = len(self.buffer) - self.offset

        offset = self.offset
        self.offset = offset + size

        return self.buffer[offset:offset + size]

    def readline(self, size=-1):
        noffset = len(self.buffer)

        for i in range(self.offset, len(self.buffer)):
            if self.buffer[i] == 0x0a:
                noffset = i+1

        result = self.buffer[self.offset:noffset]
        self.offset = noffset

        return result

    def readall(self):
        return self.read()

    def readinto(self, b):
        nbytes = len(self.buffer) - self.offset

        for i in range(nbytes):
            b[i] = self.buffer[i + self.offset]

        self.offset = len(self.buffer)
        return nbytes

sevaa commented 1 year ago

A buffer slice is not functionally equivalent to bytes. It doesn't have decode or find, which pyelftools expects. Your version of read returns a buffer slice.

You can spend some time building an in-place equivalent (that will work with all pyelftools' internals), or wrap that slice in bytes(). The latter will, most likely, create a copy.

rpm5099 commented 1 year ago

What I sent was not customized for elftools, which is why I suggested that this class should be a part of the library rather than left up to each user to figure out.

You'll have to look at the decode calls. A slice of a memoryview is a memoryview, if you want bytes from that call .tobytes(), and yes it will duplicate that portion of memory. Here's a find function:

    def find(self, pattern):
        m = re.search(pattern, self.buffer)
        if m:
            return m.start()
        return None

sevaa commented 1 year ago

Have you tried ELFFileBuffer from my snippet with an eye on memory usage? There is nothing in that fragment that will significantly benefit from first-party-level integration with pyelftools. Deriving your classes from library ones is a legitimate, if somewhat fragile, technique. What I'm saying is, I don't think your problem is pressing enough to justify modifying pyelftools over, and @eliben seems to be on the same page. Especially since I suspect your concerns can be alleviated with a rather short subclass that can live in your project.

rpm5099 commented 1 year ago

Except for accepting bytes/memoryview without the user having to figure out how to generate a wrapper class to allow for it. Ok, sounds good, I think we've gone in circles enough here.

eliben / pyelftools

Open ELF's from Memory #481