eliben / pyelftools

Parsing ELF and DWARF in Python
Other
1.99k stars 507 forks source link

Acquiring address of a .elf file #352

Open yui-ishihara opened 3 years ago

yui-ishihara commented 3 years ago

The line 104 of example/dwarf_decode_address.py is the following

process_file(sys.argv[2], 0x400503)

which, to my understanding, passes the sample sample_exe64.elf file into this program and the hard-coded number 0x400503 is its address.

Now I am wondering how I could obtain this address for my own .elf file.

sevaa commented 3 years ago

The address 0x400503 was chosen because it produces a valid source line. It has no special meaning.

What are you trying to do?

yui-ishihara commented 3 years ago

@sevaa Thanks for the reply.

I obtained a dataset that collected a number of .elf files. I tried to parse these files with this library. Concretely, I would like to map each of these binaries to their corresponding source code.

I do not know a lot about hardware and low-levels, but to my understanding, this dwarf_decode_address.py script parses the provided .elf file and maps it to the source code it is compiled from. Specifically, for the sample sample_exe64.elf file, the parsing result by invoking python dwarf_decode_address.py --test sample_exe64.elf is following

Processing file: sample_exe64.elf
Function: main
File: z.c
Line: 3

which I think this means that the sample_exe64.elf is compiled from main() in z.c source file.

Now my question is how would I know the <address> parameter for my own .elf file so that I could get similar output as above

Expected usage: dwarf_decode_address.py <address> <executable>
sevaa commented 3 years ago

An ELF file was compiled from source code which, I assume, consisted of more than one line. The example decodes just one address in the ELF file to the source like. One address, one line. Looks like you are trying to do something different.

Concretely, I would like to map each of these binaries to their corresponding source code.

Do you have the source code to these binaries? Or you are trying to retrieve the source code from the binary itself, so that you can understand what exactly does the binary do?

yui-ishihara commented 3 years ago

I do have access to the source code and I am trying to do the mapping rather than retrieval.

Actually the dataset could be found here, which is a collection of .elf compiled from 59 different Linux utilities with different compiling options.

sevaa commented 3 years ago

Can you please provide a more high level description of the problem? At the end of the day, what are you trying to learn about these binary files? Because right now, it sounds a bit like an XY question.

yui-ishihara commented 3 years ago

Yes. I am trying to do something similar to this paper, which is trying to build a neural network-based system that is able to retrieve binary given source or vice versa.

Their dataset gives the mapping between compiled binary and source code.

However, there are two issues with the dataset they provided

Therefore, I am trying to obtain a larger dataset. The best choice I could find, as I mentioned earlier, is BinKit.

So in sum, I have already got binaries (as provided in BinKit) and source (accessible as they are all Linux utilities). The question is how do I make the mapping between the binary and source. That is where I need to turn to pyelftools for help.

sevaa commented 3 years ago

So you have a set of binaries with DWARF and a set of source trees, and you need to know which binary corresponds to which source? Has there been a deliberate effort to obfuscate (e. g. two binaries with two versions of the same utility, or same code, different compiler options), or the binaries are completely unrelated?

yui-ishihara commented 3 years ago

Sorry for late reply.

So you have a set of binaries with DWARF and a set of source trees, and you need to know which binary corresponds to which source?

That is exactly what I am trying to do!

Has there been a deliberate effort to obfuscate (e. g. two binaries with two versions of the same utility, or same code, different compiler options), or the binaries are completely unrelated?

I guess answer is yes but these options are available as filename. The goal of BinKit dataset is to compare binaries under different architectures and compiling options. So for the utility a2ps, it provides a long list of compiled .elf, which look like

...
a2ps-4.14_gcc-4.9.4_arm_64_Os_a2ps.elf
a2ps-4.14_gcc-4.9.4_arm_64_Os_fixnt.elf
a2ps-4.14_gcc-4.9.4_mips_32_Os_a2ps.elf
a2ps-4.14_gcc-4.9.4_mips_32_Os_fixnt.elf
a2ps-4.14_gcc-4.9.4_mips_64_Os_a2ps.elf
a2ps-4.14_gcc-4.9.4_mips_64_Os_fixnt.elf
a2ps-4.14_gcc-4.9.4_mipseb_32_Os_a2ps.elf
a2ps-4.14_gcc-4.9.4_mipseb_32_Os_fixnt.elf
a2ps-4.14_gcc-4.9.4_mipseb_64_Os_a2ps.elf
...
a2ps-4.14_gcc-8.2.0_mips_64_Os_a2ps.elf
a2ps-4.14_gcc-8.2.0_mips_64_Os_fixnt.elf
a2ps-4.14_gcc-8.2.0_mipseb_32_Os_a2ps.elf
a2ps-4.14_gcc-8.2.0_mipseb_32_Os_fixnt.elf
a2ps-4.14_gcc-8.2.0_mipseb_64_Os_a2ps.elf
a2ps-4.14_gcc-8.2.0_mipseb_64_Os_fixnt.elf
a2ps-4.14_gcc-8.2.0_x86_32_Os_a2ps.elf
a2ps-4.14_gcc-8.2.0_x86_32_Os_fixnt.elf
a2ps-4.14_gcc-8.2.0_x86_64_Os_a2ps.elf
a2ps-4.14_gcc-8.2.0_x86_64_Os_fixnt.elf

As my goal is to create a dataset (as I mentioned in the previous thread) to map the binaries in .elf to source, it would be great if I could parse any of these .elf to the source code.

To make discussion more concrete, a sample from the aforementioned list (i.e. a2ps-4.14_gcc-8.2.0_x86_64_Os_a2ps.elf) is downloadable here (966.8KB). The corresponding a2ps-4.14 source is available here (2.43MB). So here I would like to parse a2ps-4.14_gcc-8.2.0_x86_64_Os_a2ps.elf so that sections in this .elf binary could be mapped to the following unzipped a2ps-4.14 sources.

buffer.c    lexps.c         main.h           read.h        ssheet.h
buffer.h    lexps.h         Makefile.am      regex.c       sshread.c
delegate.c  lexps.l         Makefile.in      regex.h       sshread.h
delegate.h  lexssh.c        parsessh.c       select.c      version-etc.c
ffaces.c    lexssh.l        parsessh.h       select.h      version-etc.h
ffaces.h    long-options.c  parsessh.output  sheets-map.c  versions.c
generate.c  long-options.h  parsessh.y       sheets-map.l  versions.h
generate.h  main.c          read.c           ssheet.c      yy2ssh.h
sevaa commented 3 years ago

The CPU and the compiler options are trivially parseable from the DWARF info. Find the first CU (compile unit), find the first DW_TAG_module DIE, and look at the DW_AT_producer attribute. The CPU/architecture is in the ELF header.

The DW_AT_stmt_list in the same would give you the mapping between the CU and the source lines. At the very least, you can match the filenames and line counts in the binary vs the sources. For this, you'd have to scroll over all CU's, though.

I've seen cases where a binary contains a CU for a source that is not in the source tree; those were the RTL bits like crt0.c. So the condition "a source file name from the binary is not in the source tree" is not a negative indicator by itself.

If I'm allowed to plug my own work, you can use DWARF Explorer to eyeball the whole DWARF tree - https://github.com/sevaa/dwex . See if it gives you any ideas. 頑張って。

yui-ishihara commented 3 years ago

Thank you! I will take a look at your software and comment here if I could have any findings.

sevaa commented 4 months ago

@yui-ishihara is this still an issue?