erocarrera / pefile

pefile is a Python module to read and work with PE (Portable Executable) files
MIT License
1.88k stars 522 forks source link

Enhancement / Question: How 4-byte utf-16 characters are handled in VersionInfo #435

Open maxzhenzhera opened 1 week ago

maxzhenzhera commented 1 week ago

Given

  1. Strings in VersionInfo have utf-16-le encoding
  2. To parse a string in VersionInfo get_string_u_at_rva used https://github.com/erocarrera/pefile/blob/4b3b1e2e568a88d4f1897d694d684f23d9e270c4/pefile.py#L6476-L6517
  3. In that part where "decoding" goes we can see the handling of 2-byte chunks https://github.com/erocarrera/pefile/blob/4b3b1e2e568a88d4f1897d694d684f23d9e270c4/pefile.py#L6510-L6512

Problem

Therefore, if the VersionInfo string contains a 4-byte utf-16 character - it will not be treated properly. It will result in 2 different forcefully casted Unicode characters.

Question

Am I wrong or do not know something? Or it should be fixed in pefile?

I understand that frequency meeting characters taking 4-byte size might not be big. But at the end of the day, it is not handled.