JonathanReeve / sanger

Margaret Sanger Papers Project Search Engine
0 stars 3 forks source link

Search by Journal title is broken #84

Closed CathyHajo closed 9 years ago

CathyHajo commented 9 years ago

The search by journal drop down box is malfunctioning. It still works on the old site.

JonathanReeve commented 9 years ago

The journals dropdown has these items:

There are a few things in the parse logs that look unusual, and I think they might be causing this. In 236494, the log says Processing title: The Journal of the >. I'm guessing this is from this line in the XML:

<title type="journal">The Journal of the <org>American Medical Association</org></title>

So I think what may be the issue here is that the parser for mentioned entities can't handle nested mentioned things. I'll look into this.

CathyHajo commented 9 years ago

I have been cleaning up the titles where I can, removing the nested portions from titles. But the drop down list doesn't change after I update them.

JonathanReeve commented 9 years ago

I think I fixed this, actually, so it should be able to handle nested tags in titles now. Just make sure you pull in my recent commits to your local copy. Let me know if everything looks OK.

CathyHajo commented 9 years ago

Hi Jon,

I can't tell if it fixed it-- the drop downs look the same.

Cathy

Cathy Moran Hajo, Ph.D. Associate Editor/Assistant Director The Margaret Sanger Papers Project New York University, Division of Libraries 838 Broadway, Suite 504 New York, NY 10003-4218 (212) 998-8666 cathy.hajo@nyu.edu

Visit our website at: http://www.nyu.edu/projects/sanger

On Mon, Aug 3, 2015 at 10:18 AM, Jonathan Reeve notifications@github.com wrote:

I think I fixed this, actually, so it should be able to handle nested tags in titles now. Just make sure you pull in my recent commits to your local copy. Let me know if everything looks OK.

— Reply to this email directly or view it on GitHub https://github.com/JonathanReeve/sanger/issues/84#issuecomment-127258153 .

JonathanReeve commented 9 years ago

That's no good. Could you paste a screenshot of the problem on this issue's GitHub page (https://github.com/JonathanReeve/sanger/issues/84), along with the URL you're looking at?

CathyHajo commented 9 years ago

Here's a screenshot-- I fixed the titles with the curly brackets and the one with the < as a title a long time ago. capture

CathyHajo commented 9 years ago

Oh, here's one where the title drop down shows. screen

JonathanReeve commented 9 years ago

Hm, I noticed that in some cases there are two copies of XML documents, one in xml_added, and one in xml_queue. (See, for instance: https://github.com/JonathanReeve/sanger/blob/master/xml_added/008951.xml and https://github.com/JonathanReeve/sanger/blob/master/xml_queue/008951.xml. Maybe the parse script is parsing the one from xml_queue, but your corrections were to a file in xml_added?

CathyHajo commented 9 years ago

Hi Jon,

I have been adding them to both directories each time. Is that not right?

Cathy

Cathy Moran Hajo, Ph.D. Associate Editor/Assistant Director The Margaret Sanger Papers Project New York University, Division of Libraries 838 Broadway, Suite 504 New York, NY 10003-4218 (212) 998-8666 cathy.hajo@nyu.edu

Visit our website at: http://www.nyu.edu/projects/sanger

On Mon, Aug 3, 2015 at 11:21 AM, Jonathan Reeve notifications@github.com wrote:

Hm, I noticed that in some cases there are two copies of XML documents, one in xml_added, and one in xml_queue. (See, for instance: https://github.com/JonathanReeve/sanger/blob/master/xml_added/008951.xml and https://github.com/JonathanReeve/sanger/blob/master/xml_queue/008951.xml. Maybe the parse script is parsing the one from xml_queue, but your corrections were to a file in xml_added?

— Reply to this email directly or view it on GitHub https://github.com/JonathanReeve/sanger/issues/84#issuecomment-127275342 .