eliben / pycparser

:snake: Complete C99 parser in pure Python
Other
3.24k stars 609 forks source link

Constant string concatenation #546

Open Llewyllen opened 3 months ago

Llewyllen commented 3 months ago

The following valid c99 code

char test()
{
  char* tmp = "\07""7";
  return tmp[0];
}

is wrongly parsed and returns a c_ast.Constant object with value '\077' which is incorrect. Same goes with hexadecimal.

The easy solution is to modify CParser.p_unified_string_literal by replacing p[1].value = p[1].value[:-1] + p[2][1:] by p[1].value = p[1].value + p[2]

as simply removing double quotes it not a good idea. The modification would return a value of '\07""7' which is better but needs to be parsed to get each characters.

Another solution would be to have a list of strings for the value, but that would have way more impacts on other parts of the code (like the generator)

eliben commented 3 months ago

I don't understand the issue. The following C program prints abcxyz, according to the standard:

#include <stdio.h>

int main() {
  char* str = "abc""xyz";
  printf("%s\n", str);
  return 0;
}

Can you clarify what pycparser is doing wrong, in your opinion?

Llewyllen commented 3 months ago

For octal "\07""7" is a 3 bytes string composed of 0x07 (octal value 7), 0x37 (character '7') and 0x00 (string end) "\077" is a 2 bytes strings composed of 0x3F (octal value 77) and 0x00

For hexadecimal "\x7""7" is a 3 bytes string composed of 0x07, 0x37 and 0x00 "\x77" is a 2 bytes string composed of 0x77 and 0x00

So if you simply remove consecutive double quotes (what PyCParser does), you get the wrong value

char test1()
{
  char* tmp = "\07""7";
  return tmp[0];
}

char test2()
{
  char* tmp = "\077";
  return tmp[0];
}

These 2 functions do not return the same value. First one returns 0x07, second one returns 0x3F

eliben commented 3 months ago

Ah, so it's specific to octal and hex, then... PR to fix welcome, though it has to handle all cases of string literal concatenation properly

Llewyllen commented 3 months ago

As I said, there are not that many solutions

so I won't do a PR, as there is no ideal solution

Well, I did create a PR, not sure it will pass the tests (but it works for my needs)

From what I saw, it will not pass the test_unified_string_literals test, but then, this test is rather wrong because string concatenation is not as simple as removing consecutive double quotes.

I could add the test

d7 = self.get_decl_init(r'char* s = "\07" "7";')
self.assertNotEqual(d7, ['Constant', 'string', r'"\077"'])

and the current version would fail

I just saw that p_unified_wstring_literal has the same problem, but I won't put my hand in the widechar trap