InQuest / iocextract

Defanged Indicator of Compromise (IOC) Extractor.
https://inquest.readthedocs.io/projects/iocextract/
GNU General Public License v2.0
498 stars 91 forks source link

Fails to parse this url correctly #40

Closed Ben-Steele closed 1 year ago

Ben-Steele commented 4 years ago

The url is: https://www.mysite.com/endpoint?param=abc--~C<http://anothersite.com/myfile.zip>

the trailing > is always stripped off the url even through it is part of it. When I extract_iocs I get: https://www.mysite.com/endpoint?param=abc--~C<http://anothersite.com/myfile.zip

I can give the real url that I discovered this issue with, but it is malicious so I didn't want to include it here.

Ben-Steele commented 4 years ago

^ This is not a valid URL, but some applications with url encode it and follow the link.

battleoverflow commented 1 year ago

Hi, @Ben-Steele!

The ability to control the end punctuation should now be finished.

If you are using iocextract as a library, you can remove the punctuation restriction like this:

import iocextract

def rm_puncutation():
    for url in iocextract.extract_urls("https://www.mysite.com/endpoint?param=abc--~C<http://anothersite.com/myfile.zip>", refang=True, open_punc=True):
        print(url)

rm_puncutation()

If you're using it as a CLI, this command will do the same thing:

iocextract --input urls.txt --extract-urls --open

A new version is not available yet on PyPI. I will post another comment here once a new version is available for download.

battleoverflow commented 1 year ago

The new PyPI package is now available!

PyPI: https://pypi.org/project/iocextract/1.13.8/ GitHub Releases: https://github.com/InQuest/python-iocextract/releases/tag/v1.13.8