InQuest / iocextract

Defanged Indicator of Compromise (IOC) Extractor.
https://inquest.readthedocs.io/projects/iocextract/
GNU General Public License v2.0
498 stars 91 forks source link

Add backtick to `END_PUNCTUATION` regex #81

Closed Synse closed 1 month ago

Synse commented 1 month ago

This PR adds the backtick (`) to the END_PUNCTUATION regex so that urls (and emails) surrounded in backticks are extracted properly. This will improve extraction from markdown documents where backticks are used for inline code blocks.

It also adds a basic test to ensure this is working as expected for urls; I did not add a test for emails.

Before

>>> from iocextract import extract_urls
>>> list(extract_urls('foo `https://example.com` bar'))
['https://example.com`']  # BAD: has a trailing backtick
>>>
>>> from iocextract import extract_emails
>>> list(extract_emails('foo `user@example.com` bar'))
[]  # BAD: email address not found
>>> 

After

>>> from iocextract import extract_urls
>>> list(extract_urls('foo `https://example.com` bar'))
['https://example.com']  # GOOD: no trailing backtick
>>>
>>> from iocextract import extract_emails
>>> list(extract_emails('foo `user@example.com` bar'))
['user@example.com']  # GOOD: email extracted with no trailing backtick
>>>

[!NOTE] While https://example.com/` is a valid url it should be encoded (https://example.com/%60). At the moment other valid urls like https://example.com/", https://example.com/}, etc. are being extracted without trailing punctuation so I don't think deviates substantially from what is/isn't being extracted today.

:mag: References

pedramamini commented 1 month ago

Solid PR comment, thank you 👏 also a good catch as backticks in URLs would be encoded as %60