InQuest / iocextract

Defanged Indicator of Compromise (IOC) Extractor.
https://inquest.readthedocs.io/projects/iocextract/
GNU General Public License v2.0
498 stars 91 forks source link

Improve extraction for non-defanged URLs #61

Closed battleoverflow closed 1 year ago

battleoverflow commented 1 year ago

"while it seems like the bug originally referenced in this issue is fixed in the new version, the one I commented above still exists. Defanged IPs still get extracted by extract_urls while their non-defanged counterparts don't"

Issue comment: https://github.com/InQuest/python-iocextract/issues/34#issuecomment-1381856822

luis261 commented 1 year ago

Thanks for taking my comment into account! Hopefully this can be fixed (:

battleoverflow commented 1 year ago

Hi, @luis261!

I finally got a second to look over the issue. Your comment was absolutely valuable, but time is unfortunately limited, so I wasn't able to really look into it until now. A solution is currently in testing and will be available in the next release. I've included a few examples with comments below.

You may notice a new parameter: defang_data. This way if you extract a URL or IP address that isn't defanged, you can immediately defang it during extraction a little easier. I still have some things to prepare before this release is ready, but I'm planning for this week. I'll make another comment on this thread once it's available for download!

import iocextract

data = [
    "1.1.1.1",
    "1[.]1[.]1[.]1",
    "domain.com",
    "domain[.]com"
]

for d in data:
    # Everything should be refanged
    print(list(iocextract.extract_urls(d, refang=True, no_scheme=True)))

    # Half should be defanged, half should be normal (defang_data defaults to false)
    print(list(iocextract.extract_urls(d, refang=False, no_scheme=True)))

    # Everything should be defanged
    print(list(iocextract.extract_urls(d, refang=False, no_scheme=True, defang_data=True)))
luis261 commented 1 year ago

@azazelm3dj3d Alright, thanks for keeping me updated! Once the new release is out I will check out the new behavior of extract_urls

battleoverflow commented 1 year ago

The new version is now available: https://pypi.org/project/iocextract/1.14.1/

luis261 commented 1 year ago

Alright, I verified the behavior you wrote about in your comment. However, the fundamental issue of extract_urls pulling in IPs still exists, now it even seems to be the universal behavior (as opposed to it occuring just in certain edge cases). That is just not what I'd expect after reading the documentation, considering that extract_ips exists as well ... and extract_urls is described in the documentation as extracting URLs (IPs are not mentioned)

battleoverflow commented 1 year ago

Definitely a good note for the future. Due to the repository not having too many outstanding issues relative to other open-source initiatives, I haven't taken much time to review the actual documentation and how thorough (or accurate) it is. I do have it on my backlog, but no issue assignment, so I just took care of that. Thank you for bringing that to my attention.

Issue: https://github.com/InQuest/python-iocextract/issues/65