Matmaus / LnkParse3

Windows Shortcut file (LNK) parser
MIT License
63 stars 13 forks source link

Extraction of data following the Extra Data/Terminal Block #27

Closed gdesmar closed 4 months ago

gdesmar commented 5 months ago

Hi, I was looking at some new problems that I have with parsing Lnk files, and realized that I should have tried to get one of those branches merged before. I came across Lnk files that are tacking additional data at the very end of the file, after the Terminal Block. I created my own fork of LnkParse3 and had my way of handling it, and I saw that user wmetcalf also did. We took slightly different approaches, and I was wondering if you had a preference. I would be happy to clean my repository, or make a totally new branch to be able to open a proper pull request if you are interested. For reference, this is an example of a malicious file that would benefit from being able to get the content at the end of the normal structure. Just making sure again, please be aware that this file is malicious. :) https://bazaar.abuse.ch/sample/082d5935271abf58419fb5e9de83996bd2f840152de595afa7d08e4b98b1d203

If you are looking at the git diff from my repository, you will see that I also added handling for UnknownExtra. I see that you since added handling and a warning in extra_factory.py, but I think it would be useful to be able to access that data. For reference, this LNK file has a zip file in such undefined Extra data.

Thank you for your time and the awesome library!

Matmaus commented 5 months ago

Hi, firstly, I am very happy to see that people use the library, secondly, I am definitelly interested. Any improvement is welcome :slightly_smiling_face:. Let me check this issue more this Sunday.

Matmaus commented 4 months ago

Thanks for a nicely formed issue's description. I have checked both forks and here are mi thoughts.

Unknown extra block

I like your handling of unknown blocks. If we would be able to process them, it would be great.

Regarding the future PR, I would suggest to name the class just Unknown and set its name to UNKNOWN_BLOCK to be consistent with other blocks. Also, there should be some comment that this block is not one of blocks defined in the documentation [MS-SHLLINK] rather a special handler for a really unknown blocks. Also, note that your current implementation does not print data from unknown block due to missing code inside as_dict method.

Data after Trailing block

I am not really sure here, but I prefer wmetcalf's implementation a little more. I like that there is no need to handle size in ExtraData (even it could be useful probably...).

Regarding the future PR, I would suggest to name the class TerminalBlock and set its name to TERMINAL_BLOCK (i.e. as wmetcalf did). But instead of handling this block in a special way, I would suggest to add four new signatures to EXTRA_SIGS based on the offical documentation

TerminalBlock (4 bytes): A 32-bit, unsigned integer that indicates the end of the extra data section. This value MUST be less than 0x00000004

i.e. signatures would be 0x00000000, 0x00000001, 0x00000002, and 0x00000003. Then there would not be need for a spacial handling in ExtraFactory.extra_class I think. Also, handling in ExtraData._iter_ could be simplied by refactoring. Even though TerminalBlock is metioned in the documentation [MS-SHLLINK], it is not expected it will have any data, so there is also space for an explaining comment.

gdesmar commented 4 months ago

I was able to get back into the code. It had been a while that I looked at it. Here are a few things that I wasn't sure: 1) I did not add the full data to the as_dict method because the data can be a binary. It would create a very cluttered output if I added the full binary zip file to the json. For the TerminalBlock, I plan to overwrite the size key of the as_dict from the parent, so that it reflects the length of the appended_data instead of zero.

2) I took the liberty to rename the content of TerminalBlock as appended_data instead of extra_garbage. 😅

3) Would you prefer TerminalBlock or Terminal? None of the other class name or file name contain the word "block", it only shows up in their name() function. I was going to create unknown.py -> Unknown(LnkExtraBase) and terminal.py -> Terminal(LnkExtraBase).

4) Based on the current implementation of the as_dict() function, if we encounter multiple Unknown Blocks, the dict will only contain the last one. On the plus side, the iter will return all of them. I'll add a note of it in the code, but that can be an improvement for later.

5) I tried to add the signatures for 0x00, 0x01, 0x02 and 0x03 in EXTRA_SIGS, but the issue is that the signature for all other blocks is found following the size (because of the endianness) so in the case of the terminal block, those bytes can be anything, as the first four bytes (what is usually known as the size) are the ones that are 0x00, 0x01, 0x02 or 0x03. We would still need special handling in ExtraFactory.extra_class. That also open the door to having a falsely (?) detected Terminal block in the middle of other blocks. I am not sure if we could have an Extra block with a size of 0x03, which would be ambiguously a Terminal Block. I think it's safer to just check once in ExtraData._iter at the end, if we need to use a Terminal Block structure.

6) In terminal.py, my documentation says BlockSignature == 0x00000000 - 0x00000003, and I was wondering, based on endianness, if it was wrong and should be BlockSignature == 0x00000000 - 0x03000000? Same question would apply, but with more side effects in ExtraData._iter() with if len(rest) > 4 and rest[:4] < b"\x00\x00\x00\x04":, if that check is wrong based on the endianness. I modified it so that it would cover more than only 0x00, but it may cover WAY more now. 🤨

I believe I reduced the amount of changes to the bare minimum, and have neither the size property from the ExtraFactory (wmetcalf) or the ExtraData (mine). I'm certain that this message is now longer than the code change so I just uploaded a new branch and opened a PR, so that you can see what it looks like.

Matmaus commented 4 months ago

I'm certain that this message is now longer than the code change so I just uploaded a new branch and opened a https://github.com/Matmaus/LnkParse3/pull/28, so that you can see what it looks like.

It clearly is :smile: but it really helped by to see all the difficulties I did not see before. Let's continue with discussion in the PR, I'll answer all your questions there.