Open scotmatson opened 8 years ago
Was reading the Issue response regarding PyPDF2 vs. pdfminer. I understand the reason behind sticking with pdfminer, but feel it would be worthwhile to implementing a solution that addresses Python3 problems out of the box as well.
Made changes that fixed an error due to the mixture of tabs with white spaces. But the biggest change involved making modifications for adding support for both Python2 and Python3. This included dropping the unicode encoding in html_parser, adding an updated from of the StringIO module, and adding a few additional parenthesis that were missing from various print statements.
I would like to note that I've been testing only the HTML parsing with this pull request. Nothing else should be effected but I am still learning the build of this utility. I've been testing html pages from malware-traffic-analysis.net which is pulling in many FPs - something I plan on playing with the future.
Finally I made the default PDF parser PyPDF2 as it has python2&3 support where pdfminer does not.