eliben / pyelftools

Parsing ELF and DWARF in Python
Other
1.97k stars 506 forks source link

Disagreement between objdump and pyelftools on DW_TAG_typedef::DW_AT_type #27

Open altendky opened 10 years ago

altendky commented 10 years ago

I am trying to use pyelftools to parse both TriCore and C166 ELF files (just TriCore for now) to get a list of variables and addresses including being aware of structure members. The application will be for a remote watch window for the embedded target. I have fiddled with the dwarf_die_tree.py example but ran into an issue where I am unable to connect between typedef’s and the unnamed structure definitions they reference. Or so it seems to my DWARF-ignorant brain (in case my previous comments did not in some way make that obvious).

To avoid any issues associated with my particular architecture ELF I also tried my script against test/testfiles_for_unittests/sample_exe64.elf and observed the same thing. When I run objdump -W as a reference I get, amongst other things:

<1><1d6>: Abbrev Number: 3 (DW_TAG_typedef)
    <1d7>   DW_AT_name        : (indirect string, offset: 0xcd): size_t
    <1db>   DW_AT_decl_file   : 2
    <1dc>   DW_AT_decl_line   : 214
    <1dd>   DW_AT_type        : <0x1e1>

My pyelftools script results in (again, just a snippet):

  DIE tag=DW_TAG_typedef
    Name: size_t
    Offset: 470
    File: 2
    Line: 214
    Type: 63
    Attributes: OrderedDict([('DW_AT_name', AttributeValue(name='DW_AT_name', form='DW_FORM_strp', value=b'size_t', raw_value=205, offset=471)), ('DW_AT_decl_file', AttributeValue(name='DW_AT_decl_file', form='DW_FORM_data1', value=2, raw_value=2, offset=475)), ('DW_AT_decl_line', AttributeValue(name='DW_AT_decl_line', form='DW_FORM_data1', value=214, raw_value=214, offset=476)), ('DW_AT_type', AttributeValue(name='DW_AT_type', form='DW_FORM_ref4', value=63, raw_value=63, offset=477))])
    DIE: ['__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_children', '_parent', '_parse_DIE', '_translate_attr_value', 'abbrev_code', 'add_child', 'attributes', 'cu', 'dwarfinfo', 'get_full_path', 'get_parent', 'has_children', 'is_null', 'iter_children', 'iter_siblings', 'offset', 'set_parent', 'size', 'stream', 'tag']

What seems to me to be an issue is that objdump shows DW_AT_type as 0x1e1 (481) as opposed to pyelftools which returns 63 (0x3f). The other values I have compared seem to correspond. Is this a simple lack of understanding of DWARF or a misuse of pyelftools, or is there an issue here? I started to dig in the code a bit but with my limited knowledge I didn’t find anything that looking glaringly wrong.

Here’s my system info (Python3 within Cygwin64 within Win7 64):

CYGWIN_NT-6.1 GUS-CZJCBS1 1.7.28(0.271/5/3) 2014-02-09 21:06 x86_64 Cygwin

Python 3.2.5 (default, Oct  2 2013, 22:58:11)
[GCC 4.8.1] on cygwin

Installed pyelftools from Git rev c9594acd0e1a1b87ab9a8b1de2b22c1411d617ff

Thank you for any time you choose to spend helping me. -kyle

eliben commented 10 years ago

I don't see your input file. Can you provide it and say explicitly where the mismatch is? Or can this be reproduced on one of my samples? If yes, can you specify exactly the steps?

altendky commented 10 years ago

Thanks for the reply and my apologies for the confusion. While I originally observed this on my own .ELF, the results I posted were from sample_exe64.elf.

cd pyelftools/test/testfiles_for_unittests/
curl <URL no longer valid > altendky.py
python3 altendky.py sample_exe64.elf | grep -B 1 -A 6 'Name: size_t'
objdump -W sample_exe64.elf | grep -B 1 -A 3 ': size_t'

(sorry, but that pasted code is lost and I no longer have the original...)

My results are:

$ python3 altendky.py sample_exe64.elf | grep -B 1 -A 6 'Name: size_t'
      DIE tag=DW_TAG_typedef
        Name: size_t
        Offset: 470
        File: 2
        Line: 214
        Type: 63
        Attributes: OrderedDict([('DW_AT_name', AttributeValue(name='DW_AT_name', form='DW_FORM_strp', value=b'size_t', raw_value=205, offset=471)), ('DW_AT_decl_file', AttributeValue(name='DW_AT_decl_file', form='DW_FORM_data1', value=2, raw_value=2, offset=475)), ('DW_AT_decl_line', AttributeValue(name='DW_AT_decl_line', form='DW_FORM_data1', value=214, raw_value=214, offset=476)), ('DW_AT_type', AttributeValue(name='DW_AT_type', form='DW_FORM_ref4', value=63, raw_value=63, offset=477))])
        DIE: ['__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_children', '_parent', '_parse_DIE', '_translate_attr_value', 'abbrev_code', 'add_child', 'attributes', 'cu', 'dwarfinfo', 'get_full_path', 'get_parent', 'has_children', 'is_null', 'iter_children', 'iter_siblings', 'offset', 'set_parent', 'size', 'stream', 'tag']
$ objdump -W sample_exe64.elf | grep -B 1 -A 3 ': size_t'
 <1><1d6>: Abbrev Number: 3 (DW_TAG_typedef)
    <1d7>   DW_AT_name        : (indirect string, offset: 0xcd): size_t
    <1db>   DW_AT_decl_file   : 2
    <1dc>   DW_AT_decl_line   : 214
    <1dd>   DW_AT_type        : <0x1e1>

I expect my scripts 'Type:' value (and the 'DW_AT_type' value entry in the attributes) to match that of objdump's 'DW_AT_type' value (of course, accounting for the hex vs. decimal formatting difference).

Thanks again for your interest and hopefully this makes it reasonably straightforward for you to observe the difference I do.

Cheers, -kyle

eliben commented 10 years ago

@altendky thanks for the details. It may take some time for me to get to look at this issue, but when I do the extra details certainly help.

eliben commented 10 years ago

Unfortunately I don't have much time to fix these issues right now; if you could create a pull request with a fix, that would certainly make things easier.

altendky commented 10 years ago

I certainly understand that you would have other things to do but appreciate your interest. Honestly, I haven't even gotten back to the task where I was applying this at work. That said, I may be able to get into debugging today. We'll see if I can make it anywhere.

altendky commented 10 years ago

Wrong button :[ sorry.

altendky commented 10 years ago

0x1e1 (which objdump reports) is the stream offset (I think) while 0x3f (reported by pyelftools) is the offset from the beginning of the compilation unit which starts at 0x1a2. The form is being reported within pyelftools as DW_FORM_ref4. See the DWARF 2.0 standard page 69. Note that my commit covers DW_FORM_ref[1-4] but not DW_FORM_ref_udata or DW_FORM_ref_addr.

I will test this out in my application before submitting a pull request.

JonathonReinhart commented 8 years ago

@altendky Does my issue #113 shed any light on your situation? I'm successfully getting the underlying type of a DW_TAG_typedef using the code shown there.

altendky commented 8 years ago

I have a new job since this so I don't have the original file and wherever I pasted it lost it so I can't test it quickly myself. But, at some point I will have a similar task in my new position. I'm not sure if we can get ELF files for our embedded code or not, but if we can I may come back to this.

Just looking over the code snippets and trying to refresh myself, my commit 2a195dccd7c9e0458d82843780e8e71157763658 still seems relevant since without it it seems an incorrect value would be returned for DW_FORM_ref[1-4]. Well, unless my patch was straight-up incorrect to begin with, though per the commit title it seemed to work.

Your code certainly may be good as well :] but it takes a bigger picture understanding than I presently have in my head to judge.

Regardless, thanks for the followup.

altendky commented 8 years ago

@JonathonReinhart Also, note that I was specifically having trouble with unnamed structures.

typedef struct {
    int myMember;
} MyStruct;

As opposed to named structures with a typedef.

struct MyStruct {
    int myMember;
};

typedef struct MyStruct MyStruct;

I didn't see any reference in your issue so I'm not sure which case you are working with.

http://stackoverflow.com/a/1675446/228539

altendky commented 7 years ago

Well, here I am back on this task in my new job. First I will note that this is reproducible with the still active 'pastebin' link (http://tny.cz/c7174417). Also 'backed up' at https://gist.github.com/d92fb39a86bd278442f5933f04b540dd.

But!

It would seem that I was misusing the library. The value does need the offset applied as in 2a195dccd7c9e0458d82843780e8e71157763658 to be useulf but this seems to be expected to be handled elsewhere in pyelftools. For textual output describe_attr_value() is provided and does do the translation.

DIE tag=DW_TAG_typedef
  Name: size_t
  Offset: 470
  Line: 214
  Type: 63
  describe_attr_value(Type): <0x1e1>

I hope my misunderstanding didn't waste too much of anyone's time over the past couple years...

altendky commented 7 years ago

I'm going to leave that judgement to someone else because it looks like .value vs. .raw_value may be relevant and it probably should be interpreted immediately.

https://github.com/eliben/pyelftools/blob/2300c1ffbe0f12f8bcd8f68ff1b1c6bdd0258c73/elftools/dwarf/die.py#L26

# value:
#   The value parsed from the section and translated accordingly to the form
#   (e.g. for a DW_FORM_strp it's the actual string taken from the string table)
#
# raw_value:
#   Raw value as parsed from the section - used for debugging and presentation
#   (e.g. for a DW_FORM_strp it's the raw string offset into the table)

@eliben, if you feel that this should be changed I can take a look at making the various other adjustments to 'fix the tests' (use .raw_value instead of .value or don't further translate it).