coldfusion39 / excel-press

Python script to compress VBA macro files
MIT License
23 stars 14 forks source link

compression bug #4

Open Beakerboy opened 1 year ago

Beakerboy commented 1 year ago

page 110 from the MS-OVBA spec gives three example test cases. two of them pass and one fails.

def test_normalCompresson():
    input = b'#aaabcdefaaaaghijaaaaaklaaamnopqaaaaaaaaaaaarstuvwxyzaaa'
    comp = CompressedVBA(input)
    expected = bytearray(b'\x01\x2F\xB0\x00\x23\x61\x61\x61\x62\x63\x64\x65\x82\x66\x00\x70\x61\x67\x68\x69\x6A\x01\x38\x08\x61\x6B\x6C\x00\x30\x6D\x6E\x6F\x70\x06\x71\x02\x70\x04\x10\x72\x73\x74\x75\x76\x10\x77\x78\x79\x7A\x00\x3C')
    assert comp.compress() == expected

the failure message is:

>       assert comp.compress() == expected
E       AssertionError: assert bytearray(b'\...x10wxyz\x00,') == bytearray(b'\...x10wxyz\x00<')
E         At index [28](https://github.com/Beakerboy/excel-press/actions/runs/4078878511/jobs/7029617552#step:6:29) diff: [32](https://github.com/Beakerboy/excel-press/actions/runs/4078878511/jobs/7029617552#step:6:33) != 48
E         Full diff:
E         - bytearray(b'\x01/\xb0\x00#aaabcde\x82f\x00paghij\x018\x08akl\x000mnop\x06q\x02'
E         ?                                                                 ^
E         + bytearray(b'\x01/\xb0\x00#aaabcde\x82f\x00paghij\x018\x08akl\x00 mnop\x06q\x02'
E         ?                                                                 ^
E         -           b'p\x04\x10rstuv\x10wxyz\x00<',
E         ?                    ^                  ^
E         +           b'p\x04\x00rstuv\x10wxyz\x00,',
E         ?                    ^                  ^
E           )

it looks like the copy token is \x32\x00 instead of \x30\x00 and at the end \x2C\x00 instead of \x3C\x00

Beakerboy commented 1 year ago

welp...I implemented the algorithm completely independent of yours and got the exact same results. it looks like microsoft might not be following their own docs. The matching function in an actual Office application is not choosing the first match, but sometimes a different offset with the same length. maybe the first match isn't "optimal" for some reason.

Regardless, despite compressing differently then MS, id uncompresses back to the original version, and the MS implementation also uncompresses back to the same bytes, so this may not be a bug.