erocarrera / pefile

pefile is a Python module to read and work with PE (Portable Executable) files
MIT License
1.83k stars 515 forks source link

Wrong assumption while parsing rich header #394

Open plusvic opened 4 months ago

plusvic commented 4 months ago

pefile assumes that the 12 bytes following the DanS tag are actually three copies of the 32-bits XOR key used for encrypting the rich header, and it does some validation based in this assumption:

https://github.com/erocarrera/pefile/blob/d86b96e4b33c41ebd1db7ca5b4517e94d567a260/pefile.py#L3429-L3430

The origin of this assumption seems to be this article: http://www.ntcore.com/files/richsign.htm, and YARA had this same issue. However, while rewriting the parsing logic in YARA-X I came across this other article that suggests that those 12-bytes are actually padding: https://bytepointer.com/articles/the_microsoft_rich_header.htm.

In YARA-X I treated those bytes as padding, and now, while comparing YARA and YARA-X results side-by-side, I found that file 043066108b68b30fc2c475eae8edfafc080be7d451600eaa283d2c750bddbceb proves that the padding theory is correct. The padding in that file was not initially filled with zeroes, therefore the 12 bytes after the DanS tag don't contain copies of the XOR key.

I had to fix this same issue in YARA: https://github.com/VirusTotal/yara/commit/4793b49dd79387aa4f2de98bef6cae8d43128d1c

j-t-1 commented 1 month ago
def parse_rich_header(self):
    """Parses the rich header
    see http://www.ntcore.com/files/richsign.htm for more information

    Structure:
    00 DanS ^ checksum, checksum, checksum, checksum
    10 Symbol RVA ^ checksum, Symbol size ^ checksum...
    ...
    XX Rich, checksum, 0, 0,...
    """

All bytes between "DanS" (inclusive) and "Rich" (exclusive) are XORed with the checksum, but nothing else is, in particular the padding after the checksum is not.

The checksum should probably be obtained as the four bytes after "Rich"; then the following better check could be done:

if data[0] ^ checksum != DANS or data[1] != checksum or data[2] != checksum or data[3] != checksum:
    return None

The Rich header is not documented, but Microsoft being the originator does use the above format, so it is useful. I have done a PR #402 to make it a warning.

Also, is the file you reference WIMA_SFX.EXE?

plusvic commented 1 month ago

@j-t-1 yes, the file 043066108b68b30fc2c475eae8edfafc080be7d451600eaa283d2c750bddbceb is WIMA_SFX.EXE.

j-t-1 commented 1 month ago

Thanks. I think #402 will fix this, as it takes the key directly from the key portion of the rich header, thus not relying on the padding being zeroes. Does #402 make sense in general and also meet your specific needs?

plusvic commented 1 month ago

I think #402 is ok.