Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Error relating to text file names #73

Closed tpmccallum closed 9 years ago

tpmccallum commented 9 years ago

Hi, I am getting the following error when running the make file. In order to assist I have identified the last successful line of code prior to the error; which is the successful assigning of the tokens variable which returns an object eg. <main.tokenizer object at 0x2ac2422d3490> which is around line 85 of the tokenizer.py file.

The actual error which then occurs is as follows ''' ERROR:root:Warning: file httpeprintsusqeduau114001Ogunmokun_McPhail_Chin_ANZMAC2003_PVpdf not found in jsoncatalog.txt, not encoding Traceback (most recent call last): File "bookworm/tokenizer.py", line 103, in encodeRow textid = IDfile[filename] File "/usr/lib/python2.7/bsddb/init.py", line 270, in getitem return _DeadlockWrap(lambda: self.db[key]) # self.db[key] File "/usr/lib/python2.7/bsddb/dbutils.py", line 68, in DeadlockWrap return function(args, *kwargs) File "/usr/lib/python2.7/bsddb/init.py", line 270, in return _DeadlockWrap(lambda: self.db[key]) # self.db[key] KeyError: 'httpeprintsusqeduau114001Ogunmokun_McPhail_Chin_ANZMAC2003_PVpdf' ''' Please also find excerpt from my jsoncatalog.txt ''' {"date": 2004, "uni": "USQ ePrints", "searchstring": "<a href=\"http://eprints.usq.edu.au/16/1/DanielPinkham-2004.pdf\" target=\"blank\">2004 document from USQ ePrints", "filename": "httpeprintsusqeduau161DanielPinkham2004pdf"} {"date": 2004, "uni": "USQ ePrints", "searchstring": "<a href=\"http://eprints.usq.edu.au/43/1/Dissertation-_CameronMacGregor-Q12216129-_RiskMapping.pdf\" target=\"_blank\">2004 document from USQ ePrints", "filename": "httpeprintsusqeduau431Dissertation__Cameron_MacGregorQ12216129Risk_Mapping_pdf"} {"date": 2004, "uni": "USQ ePrints", "searchstring": "<a href=\"http://eprints.usq.edu.au/44/1/DebraBARNEY_2004.pdf\" target=\"_blank\">2004 document from USQ ePrints", "filename": "httpeprintsusqeduau441DebraBARNEY_2004pdf"}

''' And an excerpt from the directory listing of files/texts/raw directory ''' httpeprintsusqeduau441DebraBARNEY_2004pdf.txt httpeprintsusqeduau85673McCarthy_Hancock_Raine_ISR_2010_AVpdf.txt httpeprintsusqeduau85762FYHE252020102520Electronic2520Details2520v2a_0pdf.txt httpeprintsusqeduau8721Gururajan_Awaiting_file_to_upload_28Conference_paper29pdf.txt httpeprintsusqeduau87403WCC2010pdf.txt httpeprintsusqeduau8741Julian_Holtedahlpdf.txt '''

Any assistance would be greatly appreciated guys. Thank you so much. Tim

tpmccallum commented 9 years ago

I have managed to resolve this by changing the way Python makes the filename string in the jsoncatalog.txt file.

tpmccallum commented 9 years ago

I noticed another issue (which is caused by my own code elsewhere) but may still be useful to note, if there are pdf files in the texts/raw directory this causes an issue. I was temporarily placing pdf files in the texts/raw directory (for no good reason) whilst I scraped their text and then removing them afterwards; I guess it left a pdf file in there when I stopped my script during tests. All good now!