Closed rpm5099 closed 1 year ago
Perversely, if we ever get to address #479, this will provide a pathway to this one.
That said - how bad is duplication in memory, exactly?
I have not done actual measurements, but at a minimum converting memoryview
to bytes
duplicates the contents in memory of whatever is in the memoryview
plus I'm sure some python overheard associated with the bytes
class. As an alternative it is possible to wrap memoryview
or bytes
in a class that makes it behave like a file handle so that you don't have to modify the other routines that accept it (i.e. so it has read
, tell
, seek
etc. methods that do the equivalent). If things are modified to accept bytes
, then simply allowing memoryview
in any type checking and treating it the same as bytes
usually works fine.
This makes it much more convenient if passing data in memory, such as a file downloaded or extracted within python . This isn't going to help with #479, all it would do is prevent duplication of the original 914MB elf in memory before things got started.
The docs state that BytesIO
accepts any object with Buffer
protocol - memoryview
included. Have you tried constructing a BytesIO
around a memoryview
and feeding it to the ELFFile
constructor? It should work.
However, once you get to DWARF parsing (and you are after that, are you?), the first thing pyelftools does is reads all debug related sections into bytes
objects. So that is duplication, if only of the DWARF related sections. You can work around that too, if you construct a BytesIO
around a memoryview
for each section and call the DWARFInfo
constructor directly. I do that for non-ELF files in DWEX - see here, in the Mach-O and PE handling code. Except I construct my BytesIO
objects around bytes
.
This portion is relatively fragile between pyelftools' versions - as the DWARF standard grows more sections, we add more arguments to the DWARFInfo
constuctor. No way around that.
For some guidance how to pull DWARF sections from ELF, see get_dwarf_info()
in pyelftools' elftools/elf/elffile.py. Supporting compressed sections will be a problem (gzipped data are not seek
able), but if your binaries don't have them, you can ignore that scenario.
EDIT: tested the BytesIO scenario, it works:
with open("test\\testfiles_for_readelf\\dwarf_lineprogramv5.elf", "rb") as f:
b = f.read(1000000)
with ELFFile(BytesIO(memoryview(b))) as elf:
di = elf.get_dwarf_info()
CUs = list(di.iter_CUs()) # Hello DWARF
exit(0)
Where pyelftools could make your life easier, we can add a check for already in-memory streams (if isinstance(stream, BytesIO)
basically) to the section retrieval logic to avoid data duplication there. As for the compressed sections, we'll still have to read and decompress those. But asking to be relieved of the terrible burden of having to construct an extraneous BytesIO
is, frankly, a but too much.
OBTW, the ELFFile
constructor doesn't accept a filename. It accepts a stream object.
The docs state that BytesIO accepts any object with Buffer protocol - memoryview included. Have you tried constructing a BytesIO around a memoryview and feeding it to the ELFFile constructor? It should work.
Right, but it duplicates the memory. We seem to be focusing on the less important reason for accepting memory buffers (bytes/memoryview) which is being able to pass files directly from other applications within python without having to io
wrap them. I was just providing some options for ways it could be done without duplicating memory. The memory/speed advantage only affects higher volumes. I didn't mean by my suggestion that all memory allocation should be eliminated throughout the processing. I will point out though that since memoryviews use pointers, slicing them into new objects doesn't duplicate the memory.
Sticking with io
throughout certainly gives the ability to manage how much of the file ends up in memory if its a file handle to disk, but it seems here that most/all of the file ends up in memory anyways right?
There are two levels of memory duplication, in a way.
First, when ELFTools
constructs a DWARFInfo
, it reads every DWARF section from the underlying stream (typically a file, in your case I guess not) into a bytes
-like object and immediately writes that into a BytesIO
object with no explicit backing store; assuming the intermediate buffer is ditched immediately after. Optionally, it decompresses the gzipped sections on that step. If the underlying stream is a file, there is no duplication there except fleetingly, and no more than one section at a time.
Second, as the library parses, there is aggressive caching of parsed objects - on the DWARFInfo
level and on the CU level. That is harder to avoid.
The first bit could have been replaced by some kind of slicing-in-place logic if the original stream is a memory one. The unbuffered, read-only version of BytesIO
is not that hard to write. But again, if the sections are compressed, it's all for naught. seek
is all over the place; parsing without seeking is possible in theory, but a massive rewrite in practice.
Give me the bigger picture please. Are you generating ELF files in memory, or reading them to memory and then parsing? Are you running into out-of-memory conditions in practice and think duplication is to blame? Because "let's read the file into memory and reuse that buffer as much as possible" is not the same as "let's not use a ton of memory". The former is one approach to the latter, but no the only one. Notably, it's the subject of #479, :) I have an alternative approach to #479 in mind, it will be slow and I/O-heavy but cheaper on memory.
This is just about being able to pass bytes/memoryview in addition to filename or io object. Yes, to both generating and reading them to memory. Parsing larger amounts of files with high performance is not the goal of this project, but there's a pretty straightforward way to allow bytes/memoryview that doesn't require duplication of the file data and memory overhead and copying happening right at the start.
When I get a chance I'll make a branch with those changes and you can take a look and see what you think.
If you write a read-only stream class (with read
, seek
, and tell
) backed by a memoryview
, and pass that to ELFFile()
it will address a good portion of your concerns. No changes to pyelftools necessary.
EDIT: here is one. Works with pyelftools, I've checked. Makes copies on reading, 'cause pyelftools expects, here and there, the result of read()
to be bytes
, and memoryview slices are not.
class memoryviewstream():
def __init__(self, mv):
self.memview = mv
self.pos = 0
def tell(self):
return self.pos
def seek(self, offset, whence = os.SEEK_SET):
if whence == os.SEEK_SET:
self.pos = offset
elif whence == os.SEEK_CUR:
self.pos += offset
elif whence == os.SEEK_END:
self.pos = len(self.memview) + offset
else:
raise ValueError()
def read(self, length):
n = len(self.memview)
if self.pos >= n:
r = b''
length = 0
else:
if self.pos + length > n:
length = n - self.pos
r = bytes(self.memview[self.pos:self.pos + length])
self.pos += length
return r
def close(self):
self.memview = None
Some of the edge case behaviors of BytesIO
were deliberately mimicked here.
This could be further enhanced if the section data retrieval logic in ELFFile.
would slice instead of reading with copying (absent compression), but that would be a change (or a monkeypatch) to pyelftools proper.
Relocations is another thing that will get in the way of plain slicing. By default, ELFFile applies the relocations...
Yes that's pretty much what I was talking about. The suggestion is to make that part of pyelftools so that each individual user doesnt have to come up with that class in order to pass bytes or memoryview.
@eliben: what do you think of this? I don't want to mess too much with the existing ELFFile
, but I could slap together a subclass of ELFFile
that would take a Buffer
-like object.
Relocation would be a bit of a challenge - you can relocate a bytearray
in place but not bytes
.
Won't be my first priority.
If this is orthogonal to the current code, it makes more sense to me to add it as an example rather than as a new exported class in pyelftools.
Okay, here is a further optimized attempt:
from elftools.elf.elffile import ELFFile
from elftools.elf.relocation import RelocationHandler
from elftools.dwarf.dwarfinfo import DebugSectionDescriptor
from elftools.common.exceptions import ELFError
class bufferstream():
def __init__(self, buf):
self.buffer = buf
self.pos = 0
def tell(self):
return self.pos
def seek(self, offset, whence = os.SEEK_SET):
if whence == os.SEEK_SET:
self.pos = offset
elif whence == os.SEEK_CUR:
self.pos += offset
elif whence == os.SEEK_END:
self.pos = len(self.buffer) + offset
else:
raise ValueError()
def read(self, length):
return bytes(self.slice(self.pos, length))
def slice(self, pos, length):
n = len(self.buffer)
if pos >= n:
r = b''
length = 0
else:
if pos + length > n:
length = n - pos
r = self.buffer[pos:pos + length]
self.pos = pos + length
return r
def close(self):
self.buffer = None
class ELFFileBuffer(ELFFile):
""" Creation: the constructor accepts a Buffer-like object - bytes, bytearray, or memoryview
"""
def __init__(self, buffer):
ELFFile.__init__(self, bufferstream(buffer))
def _read_dwarf_section(self, section, relocate_dwarf_sections):
if section.header['sh_type'] == 'SHT_NOBITS':
section_stream = bufferstream(b'\0'*self.data_size)
elif not section.compressed:
section_stream = bufferstream(section.stream.slice(section['sh_offset'], section._decompressed_size))
else:
return bufferstream(section.data()) # Reuse the decompression from the base
if relocate_dwarf_sections:
reloc_handler = RelocationHandler(self)
reloc_section = reloc_handler.find_relocations_for_section(section)
if reloc_section is not None:
raise ELFError("DWARF relocations are not supported")
#reloc_handler.apply_section_relocations(
# section_stream, reloc_section)
return DebugSectionDescriptor(
stream=section_stream,
name=section.name,
global_offset=section['sh_offset'],
size=section.data_size,
address=section['sh_addr'])
ELFFileBuffer
takes a bytes/bytearray/memoryview
. For now, I've shorted out relocations. @rpm5099: see if swapping ELFFile
for ELFFileBuffer
makes a difference in your memory use situation.
This looks good, here is a version of it that I use sometimes, but it did give me a few exceptions related to UTF conversions IIRC when using with the current version of the library.
class MemoryIO(io.RawIOBase):
__slots__ = ('buffer', 'offset')
def __init__(self, buffer):
self.buffer = memoryview(buffer).toreadonly()
self.offset = 0
def __getattr__(self, name):
if not hasattr(self.buffer, name):
raise AttributeError(f"Neither MemoryIO nor memoryview has an attribute called {name}")
return getattr(self.buffer, name)
def __getitem__(self, *args, **kwargs):
return self.buffer.__getitem__(*args, **kwargs)
def seekable(self):
return True
def seek(self, offset, whence=io.SEEK_SET):
if whence == io.SEEK_SET:
self.offset = offset
elif whence == io.SEEK_CUR:
self.offset = self.offset + offset
elif whence == io.SEEK_END:
self.offset = len(self.buffer) + offset
if self.offset < 0:
self.offset = 0
elif self.offset > len(self.buffer):
self.offset = len(self.buffer)
return self.offset
def tell(self):
return self.offset
def write(self, *args):
raise io.UnsupportedOperation('write')
def readable(self):
return self.offset != len(self.buffer)
def read(self, size=-1):
if size == -1:
size = len(self.buffer) - self.offset
elif len(self.buffer) - size < self.offset:
size = len(self.buffer) - self.offset
offset = self.offset
self.offset = offset + size
return self.buffer[offset:offset + size]
def readline(self, size=-1):
noffset = len(self.buffer)
for i in range(self.offset, len(self.buffer)):
if self.buffer[i] == 0x0a:
noffset = i+1
result = self.buffer[self.offset:noffset]
self.offset = noffset
return result
def readall(self):
return self.read()
def readinto(self, b):
nbytes = len(self.buffer) - self.offset
for i in range(nbytes):
b[i] = self.buffer[i + self.offset]
self.offset = len(self.buffer)
return nbytes
A buffer slice is not functionally equivalent to bytes
. It doesn't have decode
or find
, which pyelftools expects. Your version of read
returns a buffer slice.
You can spend some time building an in-place equivalent (that will work with all pyelftools' internals), or wrap that slice in bytes()
. The latter will, most likely, create a copy.
What I sent was not customized for elftools, which is why I suggested that this class should be a part of the library rather than left up to each user to figure out.
You'll have to look at the decode
calls. A slice of a memoryview is a memoryview, if you want bytes from that call .tobytes()
, and yes it will duplicate that portion of memory. Here's a find function:
def find(self, pattern):
m = re.search(pattern, self.buffer)
if m:
return m.start()
return None
Have you tried ELFFileBuffer
from my snippet with an eye on memory usage? There is nothing in that fragment that will significantly benefit from first-party-level integration with pyelftools. Deriving your classes from library ones is a legitimate, if somewhat fragile, technique. What I'm saying is, I don't think your problem is pressing enough to justify modifying pyelftools
over, and @eliben seems to be on the same page. Especially since I suspect your concerns can be alleviated with a rather short subclass that can live in your project.
Except for accepting bytes/memoryview without the user having to figure out how to generate a wrapper class to allow for it. Ok, sounds good, I think we've gone in circles enough here.
Could you add a change for
elftools.elf.elffile.ELFFile
to acceptbytes
,bytearray
, andmemoryview
types in addition to filename/file handle? Any elf already in memory requires either a [useless] class to make bytes look like an IO object, be converted into a BytesIO object (which duplicates data in memory), or be unnecessarily written to disk. Thanks.