Describe the bug
I have recently taken over software for a university project and it uses textract to parse pdfs and then process the text, eventually storing it in a JSON. One such function is:
# Extracts all the text from the pdf while removing superfluous/unmatched space characters
def __get_pdf_text__(self):
text = textract.process(self.filePath).decode("utf-8")
text = text.replace('\n', '').replace('\r', ' ')
return text
The extracted text then gets processed by the following script:
class CitationLoaderTxt(CitationLoaderBase.CitationLoaderBase):
# Path: path to the text file you are looking to extract citations from
# AnalyzedFiles: Array containing the info of files that have already been analyzed
def __init__(self, path):
self.regex = "\d*.(.*)\"(.*)\"(.*)|\d*.(.*)"
self.path = path
self.analyzedFiles = []
# Loads file from instance variable and runs through the file returning all
# matches to the regex supplied as a instance variable as a array
def return_citation_array(self):
if self.__has_file_been_read__():
print("Finding matches to " + self.regex + " in file at " + self.path + " to return as array")
list_of_citations = []
with open(self.path, 'r', encoding='utf8') as file:
for string in self.__nonblank_lines__(file):
match = re.search(self.regex, string)
if match.group(1) is None:
list_of_citations.append(CitationObj(match.group(4), [], "", self.path))
else:
list_of_authors = [match.group(1)]
list_of_citations.append(CitationObj(match.group(2), list_of_authors, match.group(3), self.path))
self.analyzedFiles.append(self.path)
return list_of_citations
else:
print("File already analyzed")
def return_citation_dictionary(self):
if self.__has_file_been_read__():
print("Finding matches to " + self.regex + " in file at " + self.path + " to return as dictionary")
list_of_citations = []
with open(self.path, 'r', encoding='utf8') as file:
for string in self.__nonblank_lines__(file):
match = re.search(self.regex, string)
if match.group(1) is None:
list_of_citations.append(CitationObj(match.group(4), [], "", self.path))
self.analyzedFiles.append(self.path)
else:
list_of_authors = [match.group(1)]
list_of_citations.append(CitationObj(match.group(2), list_of_authors, match.group(3), self.path))
self.analyzedFiles.append(self.path)
citation_dict = {"Citations" : list_of_citations}
self.analyzedFiles.append(self.path)
return citation_dict
else:
print("File already analyzed")
def change_file(self, new_file_path):
self.path = new_file_path
# Clears the analyzed files from the analyzedFiles list
def clear_analyzed_files(self):
self.analyzedFiles = []
# removes all blank lines from the input file to help preserve ordering
def __nonblank_lines__(self, file):
for l in file:
line = l.rstrip()
if line:
yield line
# Determines if the file has already been analyzed and returns a boolean to that effect
def __has_file_been_read__(self):
if len(self.analyzedFiles) > 0:
for file in self.analyzedFiles:
if self.path == file:
return False
return True
When trying to run it using a utf-8 codec, which the original developer had in the existing code, I get the below error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 72: invalid start byte
If I run without encoding = 'utf8' in with open() calls, I get:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 10: character maps to <undefined>
When I try other codecs, I get undecipherable results such as the below, leading me to believe it is obviously not that codec. Similarly, if I stick with UTF-8 and I put errors = 'ignore' in with open() which use the extracted text, I get such results:
Desktop (please complete the following information):
Windows 10
Python 3.8
No VE
Additional context
I have tried a fresh install of Python, my IDE and any libraries in case I made any errors when installing and setting up the Python environment, but this did not solve the issue.
I have used multiple pdfs, both ones the original developer had in his testing data and ones I had personally and all produce these errors.
Describe the bug I have recently taken over software for a university project and it uses textract to parse pdfs and then process the text, eventually storing it in a JSON. One such function is:
The extracted text then gets processed by the following script:
When trying to run it using a utf-8 codec, which the original developer had in the existing code, I get the below error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 72: invalid start byte
If I run without encoding = 'utf8' in with open() calls, I get:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 10: character maps to <undefined>
When I try other codecs, I get undecipherable results such as the below, leading me to believe it is obviously not that codec. Similarly, if I stick with UTF-8 and I put errors = 'ignore' in with open() which use the extracted text, I get such results:
Desktop (please complete the following information): Windows 10 Python 3.8 No VE
Additional context I have tried a fresh install of Python, my IDE and any libraries in case I made any errors when installing and setting up the Python environment, but this did not solve the issue.
I have used multiple pdfs, both ones the original developer had in his testing data and ones I had personally and all produce these errors.
Any and all help is much appreciated.