Closed fhightower closed 7 years ago
Nvm, I'm working on a project here that provides more modular functionality: https://github.com/fhightower/ioc-finder
Floyd -- HTML and PDF are huge elements of our Data Mining and use of this tool. What's required to move these functions forward? Note: We're currently using Adobe's PDF=>DOCX converters to get stuff into Mediawiki. Would an intermediate conversion like this solve near term issues?
Sorry I've been so long in getting back to you.
I was just suggesting that this project be made more modular so that the html and pdf parsing are handled in one project and the IOC parsing in another. This way it would be possible to parse IOCs from an arbitrary string of text without having to worry about reading pdf or html.
I'm working on another project with indicator parsing that is a stand alone package and I'll drop a link in here soon.
I'm working on a stand-alone ioc parser here: https://github.com/fhightower/ioc-finder.
Are there any thoughts, concerns, and/or objections to simplifying this project by removing the pdf and html handling portions of the project? This would convert this script from more of a command-line tool to a package that can be used in diverse python scripts.
Doing so would simplify the requirements needed for this package (thus, allowing python3 support), streamline the codebase of the package itself, and let us focus more time and energy on making the indicator parsing more robust.
Looking at the issues on this project, many of them are related to the pdf parsing libraries associated with this package. While the PDF parsing is nice, it also limits this project to python2 and prevents it from being a robust, modular solution that solves one problem very well. Just curious to hear any thoughts on this as I'm willing to get involved in the development process.