eliben / pycparser

:snake: Complete C99 parser in pure Python
Other
3.24k stars 609 forks source link

offsetof parsing fails due to TYPEID as offsetof_member_designator #504

Open nxmaintainer opened 1 year ago

nxmaintainer commented 1 year ago

I'm parsing cpython/Object/exceptions.c with pycparser==2.21 (pypi), preprocessed (exceptions.i) with cpp -nostdinc -E -P -DPy_BUILD_CORE=1 -D_POSIX_THREADS=1 + standard includes and fake_libc_include. Nothing special or tricky.

Fails in this block:

static PyMemberDef UnicodeError_members[] = {
{"encoding", 6, offsetof(PyUnicodeErrorObject, encoding), 0,       // <- parsed correctly
"exception encoding"},
{"object", 6, offsetof(PyUnicodeErrorObject, object), 0,           // <- fails
"exception object"},
{"start", 19, offsetof(PyUnicodeErrorObject, start), 0,
"exception start"},
{"end", 19, offsetof(PyUnicodeErrorObject, end), 0,
"exception end"},
{"reason", 6, offsetof(PyUnicodeErrorObject, reason), 0,
"exception reason"},
{0}
};

and particularly on offsetof(PyUnicodeErrorObject, object) with pycparser.plyparser.ParseError: :9792:46: before: object

Works perfectly fine if I replace object field name with anything else, or replace the offsetof function. There's a difference in parsing the first offsetof in this block (encoding field) and the next one (object field) according to the debug mode.

For `encoding`, take a look at `LexToken(ID,'encoding',9790,365338)` closer to the end: ``` Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA . LexToken(OFFSETOF,'offsetof',9790,365307) Action : Reduce rule [empty -> ] with [] and goto state 533 Result : (None) State : 533 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA empty . LexToken(OFFSETOF,'offsetof',9790,365307) Action : Reduce rule [designation_opt -> empty] with [None] and goto state 532 Result : (None) State : 532 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt . LexToken(OFFSETOF,'offsetof',9790,365307) Action : Shift and goto state 165 State : 165 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF . LexToken(LPAREN,'(',9790,365315) Action : Shift and goto state 304 State : 304 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN . LexToken(TYPEID,'PyUnicodeErrorObject',9790,365316) Action : Shift and goto state 35 State : 35 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN TYPEID . LexToken(COMMA,',',9790,365336) Action : Reduce rule [typedef_name -> TYPEID] with [] and goto state 31 Result : (IdentifierType(names=['PyUnicodeErrorObj ...) State : 31 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN typedef_name . LexToken(COMMA,',',9790,365336) Action : Reduce rule [type_specifier -> typedef_name] with [] and goto state 212 Result : (IdentifierType(names=['PyUnicodeErrorObj ...) State : 212 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN type_specifier . LexToken(COMMA,',',9790,365336) Action : Reduce rule [specifier_qualifier_list -> type_specifier] with [] and goto state 216 Result : ({'qual': [], 'storage': [], 'type': [Ide ...) State : 216 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN specifier_qualifier_list . LexToken(COMMA,',',9790,365336) Action : Reduce rule [empty -> ] with [] and goto state 320 Result : (None) State : 320 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN specifier_qualifier_list empty . LexToken(COMMA,',',9790,365336) Action : Reduce rule [abstract_declarator_opt -> empty] with [None] and goto state 350 Result : (None) State : 350 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN specifier_qualifier_list abstract_declarator_opt . LexToken(COMMA,',',9790,365336) Action : Reduce rule [type_name -> specifier_qualifier_list abstract_declarator_opt] with [,None] and goto state 438 Result : (Typename(name=None,quals=[],align=None,t ...) State : 438 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN type_name . LexToken(COMMA,',',9790,365336) Action : Shift and goto state 507 State : 507 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN type_name COMMA . LexToken(ID,'encoding',9790,365338) Action : Shift and goto state 159 State : 159 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN type_name COMMA ID . LexToken(RPAREN,')',9790,365346) Action : Reduce rule [identifier -> ID] with ['encoding'] and goto state 541 Result : (ID(name='encoding')) State : 541 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN type_name COMMA identifier . LexToken(RPAREN,')',9790,365346) Action : Reduce rule [offsetof_member_designator -> identifier] with [] and goto state 540 Result : (ID(name='encoding')) State : 540 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN type_name COMMA offsetof_member_designator . LexToken(RPAREN,')',9790,365346) Action : Shift and goto state 563 State : 563 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN type_name COMMA offsetof_member_designator RPAREN . LexToken(COMMA,',',9790,365347) Action : Reduce rule [primary_expression -> OFFSETOF LPAREN type_name COMMA offsetof_member_designator RPAREN] with ['offsetof','(',,',',,')'] and goto state 158 Result : (FuncCall(name=ID(name='offsetof'),args=E ...) State : 158 ```
For `object`, take a look at `LexToken(TYPEID,'object',9792,365420)` in the same position where `encoding` has `ID` instead: ``` Stack : translation_unit declaration_specifiers declarator EQUALS brace_open initializer_list COMMA designation_opt brace_open initializer_list COMMA . LexToken(OFFSETOF,'offsetof',9792,365389) Action : Reduce rule [empty -> ] with [] and goto state 533 Result : (None) State : 533 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open initializer_list COMMA designation_opt brace_open initializer_list COMMA empty . LexToken(OFFSETOF,'offsetof',9792,365389) Action : Reduce rule [designation_opt -> empty] with [None] and goto state 532 Result : (None) State : 532 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open initializer_list COMMA designation_opt brace_open initializer_list COMMA designation_opt . LexToken(OFFSETOF,'offsetof',9792,365389) Action : Shift and goto state 165 State : 165 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open initializer_list COMMA designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF . LexToken(LPAREN,'(',9792,365397) Action : Shift and goto state 304 State : 304 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open initializer_list COMMA designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN . LexToken(TYPEID,'PyUnicodeErrorObject',9792,365398) Action : Shift and goto state 35 State : 35 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open initializer_list COMMA designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN TYPEID . LexToken(COMMA,',',9792,365418) Action : Reduce rule [typedef_name -> TYPEID] with [] and goto state 31 Result : (IdentifierType(names=['PyUnicodeErrorObj ...) State : 31 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open initializer_list COMMA designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN typedef_name . LexToken(COMMA,',',9792,365418) Action : Reduce rule [type_specifier -> typedef_name] with [] and goto state 212 Result : (IdentifierType(names=['PyUnicodeErrorObj ...) State : 212 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open initializer_list COMMA designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN type_specifier . LexToken(COMMA,',',9792,365418) Action : Reduce rule [specifier_qualifier_list -> type_specifier] with [] and goto state 216 Result : ({'qual': [], 'storage': [], 'type': [Ide ...) State : 216 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open initializer_list COMMA designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN specifier_qualifier_list . LexToken(COMMA,',',9792,365418) Action : Reduce rule [empty -> ] with [] and goto state 320 Result : (None) State : 320 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open initializer_list COMMA designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN specifier_qualifier_list empty . LexToken(COMMA,',',9792,365418) Action : Reduce rule [abstract_declarator_opt -> empty] with [None] and goto state 350 Result : (None) State : 350 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open initializer_list COMMA designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN specifier_qualifier_list abstract_declarator_opt . LexToken(COMMA,',',9792,365418) Action : Reduce rule [type_name -> specifier_qualifier_list abstract_declarator_opt] with [,None] and goto state 438 Result : (Typename(name=None,quals=[],align=None,t ...) State : 438 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open initializer_list COMMA designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN type_name . LexToken(COMMA,',',9792,365418) Action : Shift and goto state 507 State : 507 Stack : translation_unit declaration_specifiers declarator EQUALS brace_open initializer_list COMMA designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN type_name COMMA . LexToken(TYPEID,'object',9792,365420) ERROR: Error : translation_unit declaration_specifiers declarator EQUALS brace_open initializer_list COMMA designation_opt brace_open initializer_list COMMA designation_opt OFFSETOF LPAREN type_name COMMA . ```

I approximately understand the issue, object is being interpreted as TYPEID for some reason (I've checked, and didn't find object type being defined/declared in the preprocessed file), so it doesn't fit offsetof_member_designator rule (which requires identifier, which is ID) and fails the primary OFFSETOF expression. I even have a dirty fix, like this:

    def p_offsetof_identifier(self, p):
        """ offsetof_identifier  : ID
                                   | TYPEID
        """
        p[0] = c_ast.ID(p[1], self._token_coord(p, 1))

    def p_offsetof_member_designator(self, p):
        """ offsetof_member_designator : offsetof_identifier
                                         | offsetof_member_designator PERIOD offsetof_identifier
                                         | offsetof_member_designator LBRACKET expression RBRACKET
        """
        if len(p) == 2:
            p[0] = p[1]
        ...

But I don't think this is a correct approach, and looks like the issue is deeper (object initially shouldn't be TYPEID in this context, no?). @eliben / @Ksero I'd really appreciate if you can point me to a better solution, I'd be happy to contribute.

P.S. Please, use exceptions.i for tests, I've tried to make smaller reproducible sample, it just works fine.

eliben commented 1 year ago

Thanks for the detailed report.

To help further narrow down the issue you can insert a printout (or a stack trace) where CParser adds object to the type map (from which point on it considers it a TYPEID) - this can tell us why it thinks it's a pre-declared type.