armbues / ioc_parser

Tool to extract indicators of compromise from security reports in PDF format
MIT License
428 stars 171 forks source link

Are there any thoughts on removing the pdf and html handling from this project? #39

Closed fhightower closed 7 years ago

fhightower commented 7 years ago

Are there any thoughts, concerns, and/or objections to simplifying this project by removing the pdf and html handling portions of the project? This would convert this script from more of a command-line tool to a package that can be used in diverse python scripts.

Doing so would simplify the requirements needed for this package (thus, allowing python3 support), streamline the codebase of the package itself, and let us focus more time and energy on making the indicator parsing more robust.

Looking at the issues on this project, many of them are related to the pdf parsing libraries associated with this package. While the PDF parsing is nice, it also limits this project to python2 and prevents it from being a robust, modular solution that solves one problem very well. Just curious to hear any thoughts on this as I'm willing to get involved in the development process.

fhightower commented 7 years ago

Nvm, I'm working on a project here that provides more modular functionality: https://github.com/fhightower/ioc-finder

packet-rat commented 7 years ago

Floyd -- HTML and PDF are huge elements of our Data Mining and use of this tool. What's required to move these functions forward? Note: We're currently using Adobe's PDF=>DOCX converters to get stuff into Mediawiki. Would an intermediate conversion like this solve near term issues?

fhightower commented 6 years ago

Sorry I've been so long in getting back to you.

I was just suggesting that this project be made more modular so that the html and pdf parsing are handled in one project and the IOC parsing in another. This way it would be possible to parse IOCs from an arbitrary string of text without having to worry about reading pdf or html.

I'm working on another project with indicator parsing that is a stand alone package and I'll drop a link in here soon.

fhightower commented 6 years ago

I'm working on a stand-alone ioc parser here: https://github.com/fhightower/ioc-finder.