Added parsing via PDF2TXT instead of PdfFileReader + patterns

armbues / ioc_parser

Tool to extract indicators of compromise from security reports in PDF format

MIT License

428 stars 171 forks source link

Added parsing via PDF2TXT instead of PdfFileReader + patterns #5

Closed cudeso closed 9 years ago

cudeso commented 9 years ago

PdfFileReader fails to extract all the IPs and info from Kaspersky document Equation Group, this works with PDF2TXT but then you loose the page-numbers (acceptable IMHO).

armbues commented 9 years ago

I'm working on a version of ioc-parser that can switch between different PDF parsing libraries. PDF2TXT is part of pdfminder which can in fact be used in a way that page-numbers are stored. Taking away a feature like page-numbers might be acceptable but there's no way to know how many users are relying on the feature.