deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

UnicodeDecodeErrors #337

Open 0x4A42 opened 4 years ago

0x4A42 commented 4 years ago

Describe the bug I have recently taken over software for a university project and it uses textract to parse pdfs and then process the text, eventually storing it in a JSON. One such function is:

# Extracts all the text from the pdf while removing superfluous/unmatched space characters
    def __get_pdf_text__(self):
        text = textract.process(self.filePath).decode("utf-8")
        text = text.replace('\n', '').replace('\r', '  ')
        return text

The extracted text then gets processed by the following script:

class CitationLoaderTxt(CitationLoaderBase.CitationLoaderBase):
    # Path: path to the text file you are looking to extract citations from
    # AnalyzedFiles: Array containing the info of files that have already been analyzed
    def __init__(self, path):
        self.regex = "\d*.(.*)\"(.*)\"(.*)|\d*.(.*)"
        self.path = path
        self.analyzedFiles = []

    # Loads file from instance variable and runs through the file returning all
    # matches to the regex supplied as a instance variable as a array
    def return_citation_array(self):
        if self.__has_file_been_read__():
            print("Finding matches to " + self.regex + " in file at " + self.path + " to return as array")
            list_of_citations = []
            with open(self.path, 'r', encoding='utf8') as file:
                for string in self.__nonblank_lines__(file):
                    match = re.search(self.regex, string)
                    if match.group(1) is None:
                        list_of_citations.append(CitationObj(match.group(4), [], "", self.path))
                    else:
                        list_of_authors = [match.group(1)]
                        list_of_citations.append(CitationObj(match.group(2), list_of_authors, match.group(3), self.path))
            self.analyzedFiles.append(self.path)
            return list_of_citations
        else:
            print("File already analyzed")

    def return_citation_dictionary(self):
        if self.__has_file_been_read__():
            print("Finding matches to " + self.regex + " in file at " + self.path + " to return as dictionary")
            list_of_citations = []
            with open(self.path, 'r', encoding='utf8') as file:
                for string in self.__nonblank_lines__(file):
                    match = re.search(self.regex, string)
                    if match.group(1) is None:
                        list_of_citations.append(CitationObj(match.group(4), [], "", self.path))
                        self.analyzedFiles.append(self.path)
                    else:
                        list_of_authors = [match.group(1)]
                        list_of_citations.append(CitationObj(match.group(2), list_of_authors, match.group(3), self.path))
                        self.analyzedFiles.append(self.path)
            citation_dict = {"Citations" : list_of_citations}
            self.analyzedFiles.append(self.path)
            return citation_dict
        else:
            print("File already analyzed")

    def change_file(self, new_file_path):
        self.path = new_file_path

    # Clears the analyzed files from the analyzedFiles list
    def clear_analyzed_files(self):
        self.analyzedFiles = []

    # removes all blank lines from the input file to help preserve ordering
    def __nonblank_lines__(self, file):
        for l in file:
            line = l.rstrip()
            if line:
                yield line

    # Determines if the file has already been analyzed and returns a boolean to that effect
    def __has_file_been_read__(self):
        if len(self.analyzedFiles) > 0:
            for file in self.analyzedFiles:
                if self.path == file:
                    return False
        return True

When trying to run it using a utf-8 codec, which the original developer had in the existing code, I get the below error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 72: invalid start byte

If I run without encoding = 'utf8' in with open() calls, I get: UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 10: character maps to <undefined>

When I try other codecs, I get undecipherable results such as the below, leading me to believe it is obviously not that codec. Similarly, if I stick with UTF-8 and I put errors = 'ignore' in with open() which use the extracted text, I get such results:

[{"__class__": "CitationObj", "__module__": "Citations.CitationObj", "title": "\u2020\u2021L\u017e|}\u00fa\u0002Ii8\u203arJUb\u00a3\u001b7\u00fa\u0006\u00a2\u203a\u2021\u203a\u00e8\u00e6\u00db7\u00ef\u00ee\u00de|\u00f1\u00be\u00cco\u0160\u00b0\u00cc\u00b2\u00e4\u00e6\u00ee\u00fe&\u00ceu\u00a8\u00b3\u00fc&/\u00f30/\u00d2\u203a\u00bb\u00e3\u00cd?\u0192\u00f7\u00b7E\u0014\u201e\u00b7{\u2022F\u00c1\u00f7\u00d5`;\u0006\u00ab\u00ee\u00c8\u00c0\u00d7B\u00fb0\u00ce\u00b4\u00b7\u00ff\u00ba\u00fb\u00e3M\u2018\u2021I\u2019\u00dcDa\u201dH'?\u00ab,!\u00d2>", "author": [], "journal": "", "id": "C:/Users/name/Downloads/paper-8641.pdf", "classification": "Academic"}

Desktop (please complete the following information): Windows 10 Python 3.8 No VE

Additional context I have tried a fresh install of Python, my IDE and any libraries in case I made any errors when installing and setting up the Python environment, but this did not solve the issue.

I have used multiple pdfs, both ones the original developer had in his testing data and ones I had personally and all produce these errors.

Any and all help is much appreciated.

traverseda commented 3 years ago

@deanmalmgren Do you still maintain this? This pull request solves a pretty big bug for me, and I'm unclear if this project still has any maintainers.