JonathanReeve / sanger

Margaret Sanger Papers Project Search Engine
0 stars 3 forks source link

Issues with documents, rebuild, search #83

Open CathyHajo opened 9 years ago

CathyHajo commented 9 years ago

I'm trying to make sure that all the documents are there, and am coming across instances where multiple copies appear to be in the database. Am using the List all documents and then looking at them in alpha order. Sometimes we have it in with 2 different numbers and that is easy to fix. But have come across some like this one-- Search in title for "dismissal" and you will get 2 responses, One returns a blank page and is referencing 421997.xml.b in the URL (i'm guessing that is the BAK file) The other returns the correct document and references 421997.xml.
When I search for the .bak file, it is not in the xml_added or xml_queue directories.

I tried doing a new git pull and rebuilding the database. There are others like this. Not sure how to handle.

JonathanReeve commented 9 years ago

My guess is that a .bak file or some other non-.xml file got committed to xml_queue, but that the parser is trying to parse those files, and/or the rebuild script doesn't see or isn't prepared to handle non-XML files in the XML directory.

The .bak files I was able to find in xml_added were these:

236191.xml.bak 236938.xml.bak 421940.xml.bak 421961.xml.bak 236386.xml.bak 421066.xml.bak 421953.xml.bak 421997.xml.bak

and those I was able to find in xml_queue were these:

143743.xml.bak 421906.xml.bak mepTemplate.xml.bak

I'll see if I can add *.bak to the .gitignore file so that there's less of a chance they get committed in the future.