Open CathyHajo opened 11 years ago
I'll look into this.
It looks like the list of journals on search.php is populated by parse2.php by grabbing <title>
s only within <sourceDesc>
and <bibl>
, and not grabbing any titles from elsewhere in the XML. So it's mysterious why the example you give in document 206332 has "Birth Control Review" (no "the") in <sourceDesc> -> <bibl> -> <title>
, and yet is associated with the journal title "The Birth Control Review". My guess would be that this document's XML header originally had "The Birth Control Review", but that the XML document was changed after it was parsed, so that the database retained the old journal title. I tried reparsing this document, and it doesn't appear as associated with "The Birth Control Review" after the reparse, although its "journal" field in the database is blank, which is a whole new problem. Maybe there's something malformed elsewhere in the XML document?
I did notice that the list of journals contains books and other non-journals, like the Britannica Book of the Year, but since they're all marked up with <sourceDesc> -> <bibl> -> <title>
the way to remove these would be to add extra markup to differentiate the ones that shouldn't appear in the journals list. If you let me know which ones shouldn't appear in the list, we can figure out a way to mark them up so that they don't appear there.
Microfilm bibliography entires are already ignored by the parsing engine. If <sourceDesc>
and <title>
contains "LCM," "MCM," or "margaret sanger microfilm", it doesn't add that entry to the list of journals when it parses the XML, so there's no need to remove the <title>
tags there. Are there microfilm citations that appear in the journals drop-down list?
This problem and the problem of redundant entries in #18 can probably be solved by completely rebuilding the database. (Which will have to be done at some point anyway to parse mentions.) This could probably be done by:
journals
table in phpMyAdmin (sanger -> "check all" -> "with selected" -> empty) xml_queue
parse.php
on all the XML filesIt'd be best to back up the database first, though, in case anything goes wrong.
We started tagging titles with the attribute type="journal" so that should solve it.
On the search page, the dropdown list of titles that are supposed to "find documents from a particular journal," but it is doing something different. I think that everything tagged
<title>
is going into that list, even if it is a book referred to in the text of the speech.What I think should go in this search is every title that is in the
<sourceDesc>
. Maybe we should take the<title>
tags off of the microfilm citation, because we don't want that turning up in this search either. Or we should add an attribute to those that we can use to eliminate them from the drop down.An example is http://www.nyu.edu/projects/sanger/webedition/app/documents/show.php?sangerDoc=206332.xml I got to this by following the journal title "The Birth Control Review" but then noticed that the title in the header is "Birth Control Review" and the portion that has the "The" is in the text. Hope that makes sense.