Closed tpmccallum closed 9 years ago
I have managed to resolve this by changing the way Python makes the filename string in the jsoncatalog.txt file.
I noticed another issue (which is caused by my own code elsewhere) but may still be useful to note, if there are pdf files in the texts/raw directory this causes an issue. I was temporarily placing pdf files in the texts/raw directory (for no good reason) whilst I scraped their text and then removing them afterwards; I guess it left a pdf file in there when I stopped my script during tests. All good now!
Hi, I am getting the following error when running the make file. In order to assist I have identified the last successful line of code prior to the error; which is the successful assigning of the tokens variable which returns an object eg. <main.tokenizer object at 0x2ac2422d3490> which is around line 85 of the tokenizer.py file.
The actual error which then occurs is as follows ''' ERROR:root:Warning: file httpeprintsusqeduau114001Ogunmokun_McPhail_Chin_ANZMAC2003_PVpdf not found in jsoncatalog.txt, not encoding Traceback (most recent call last): File "bookworm/tokenizer.py", line 103, in encodeRow textid = IDfile[filename] File "/usr/lib/python2.7/bsddb/init.py", line 270, in getitem return _DeadlockWrap(lambda: self.db[key]) # self.db[key] File "/usr/lib/python2.7/bsddb/dbutils.py", line 68, in DeadlockWrap return function(args, *kwargs) File "/usr/lib/python2.7/bsddb/init.py", line 270, in
return _DeadlockWrap(lambda: self.db[key]) # self.db[key]
KeyError: 'httpeprintsusqeduau114001Ogunmokun_McPhail_Chin_ANZMAC2003_PVpdf'
'''
Please also find excerpt from my jsoncatalog.txt
'''
{"date": 2004, "uni": "USQ ePrints", "searchstring": "<a href=\"http://eprints.usq.edu.au/16/1/DanielPinkham-2004.pdf\" target=\"blank\">2004 document from USQ ePrints", "filename": "httpeprintsusqeduau161DanielPinkham2004pdf"}
{"date": 2004, "uni": "USQ ePrints", "searchstring": "<a href=\"http://eprints.usq.edu.au/43/1/Dissertation-_CameronMacGregor-Q12216129-_RiskMapping.pdf\" target=\"_blank\">2004 document from USQ ePrints", "filename": "httpeprintsusqeduau431Dissertation__Cameron_MacGregorQ12216129Risk_Mapping_pdf"}
{"date": 2004, "uni": "USQ ePrints", "searchstring": "<a href=\"http://eprints.usq.edu.au/44/1/DebraBARNEY_2004.pdf\" target=\"_blank\">2004 document from USQ ePrints", "filename": "httpeprintsusqeduau441DebraBARNEY_2004pdf"}
''' And an excerpt from the directory listing of files/texts/raw directory ''' httpeprintsusqeduau441DebraBARNEY_2004pdf.txt httpeprintsusqeduau85673McCarthy_Hancock_Raine_ISR_2010_AVpdf.txt httpeprintsusqeduau85762FYHE252020102520Electronic2520Details2520v2a_0pdf.txt httpeprintsusqeduau8721Gururajan_Awaiting_file_to_upload_28Conference_paper29pdf.txt httpeprintsusqeduau87403WCC2010pdf.txt httpeprintsusqeduau8741Julian_Holtedahlpdf.txt '''
Any assistance would be greatly appreciated guys. Thank you so much. Tim