extract_unencoded_url is too greedy when parsing Windows command lines

0x4d4c commented 1 year ago

I'm parsing input containing examples of PowerShell or cmd.exe command lines. When a command flag with a slash comes after an URL, then the flag is included in the extracted URL.

Here is an example:

list(iocextract.extract_unencoded_urls("command.exe https://pypi.org/project/iocextract/ /f"))
  # => ['https://pypi.org/project/iocextract/ /f']

The trailing /f should not be included in the extracted URL.

DragonistYJ commented 1 year ago

您好，我已经收到您的周报，周报收取截止时间为每周二下午八点，八点后将截止收取，请按时发送您的周报，谢谢！

battleoverflow commented 1 year ago

Hi, @0x4d4c!

I think I was able to fix the issue in a way that shouldn't disrupt normal extraction. I decided to add a new regex expression to the strip parameter. You can see an example of my solution below. Since most URLs do not contain whitespace, this new code will extract anything that follows the pattern: whitespace + /\ + character, so something like https://example.com/f should still work.

If you run into any issues, feel free to let me know. I'll ping you when a new version is available from PyPi so you can test out this new addition.

Example:

import iocextract

def locate_url():
    data = "command.exe https://pypi.org/project/iocextract/ /f /n /a \s ///xhh /no \\\\f /d \a"
    return list(iocextract.extract_unencoded_urls(data, strip=True))

print(locate_url()) # => ['https://pypi.org/project/iocextract/']

I'll close this issue as soon as the new release is out.

battleoverflow commented 1 year ago

You can download the new version from PyPi now.

New release: https://pypi.org/project/iocextract/1.13.2/

0x4d4c commented 1 year ago

Wow, that was blazing fast! I tested the new release from PyPI and my sample files are processed correctly now. Thank you very much!

InQuest / iocextract

extract_unencoded_url is too greedy when parsing Windows command lines #53