minor: poem titles as search results are coming up under previous poem title

ba001 commented 7 years ago

search 'marriage heaven hell'

notice that the first result (which is for the title of the Marriage of Heaven and Hell) is under For Children the Gates of Paradise, which is the poem before it

ba001 commented 7 years ago

@queryluke could this just be a matter of mislabeling, or is it deeper?

queryluke commented 7 years ago

It's deeper. When erdman xml is parsed it creates Page json objects. These objects have a "headings" attribute which is a nested array of headings ids. For page 33, the heading attribute is:

"headings":["[[\"b1\", [[\"b1.6\", []], [\"b1.7\", [[\"b1.7.1\", []]]]]]]"],

b1.6 is For Children because there is a little bit of this poem on page 33 b1.7 is Marriage

So this is a bit of a conundrum. Right now the script always selects the first 2nd level header in the list (in this case b1.6), I can switch it to always accept the LAST 2nd level header (b1.7), but then a search for "mother sister" (the last line of Children) would show the result under Marriage.

Fixing this on the javascript side is nearly impossible and ugly. So it's something you'll want to discuss with Nathan. I'm sure he'll have his own ideas on how to fix it.

ba001 commented 7 years ago

ok, gonna assign to nathan

nathan-rice commented 7 years ago

It isn't clear exactly what the desired behavior is here. As Luke mentioned, the page title is set to the first header. I can set it to the last header, or the second header (if there is more than one).

ba001 commented 7 years ago

The title in the results should be the poem title of the poem/work that contains the line

nathan-rice commented 7 years ago

The information isn't stored that way. Pages titles are mapped to headers in a one to one relation. If you want I can stuff all the headers in the mapping, and you can write javascript to pick the one that should actually be displayed.

ba001 commented 7 years ago

The number of titles per page is inconsistent, so choosing first or second would be arbitrary.

Basically the result should correspond to the actual poem it's in

ba001 commented 7 years ago

Ok, I'll confer with joe

ba001 commented 7 years ago

had a look more closely. i'm not sure what to say. if someone searches "marriage heaven hell" and the title of the poem "The Marriage of Heaven and Hell" is a result, then that is what should show as the header of the result, not the previous poem's title. i understand the issue in the code, but we do need to fix it. i'm not sure what you mean by stuffing all the headers in the mapping--i haven't looked at the code closely--but if you did that, how would we select the right one in javascript? we'd have to do another mini search in the javascript? is there a way to detect a result coming from a poem/work title and then use that title as the result header?

nathan-rice commented 7 years ago

The problem here stems from the fact that your unit of data is a page, but your desired unit of search results is not.

In my opinion the best option is not to use the page heading in the search results, but instead use the page number. That is technically correct and avoids confusion.

Probably the most direct way to get the behavior you want is if you do a javascript search on the page for the relevant text, then work backwards in the dom from that text node to the previous heading, which you then use for the title. Any other option would require completely redoing how data is stored in solr, which basically would involve rewriting the entire application.

ba001 commented 7 years ago

we can't use the page number because then the results wouldn't amount to a proper concordance and the information conveyed would be a lot less useful.

we'll have to go the javascript way. @queryluke, is this the solution in the javascript that you were thinking of?

ba001 commented 7 years ago

@nathan-rice i wanted to remind you of this issue. joe v. just pointed out another instance of it. search "sin". the second result under THE [ FIRST ]BOOK OF URIZEN is actually a line in THE BOOK of AHANIA, which comes after THE [ FIRST ]BOOK OF URIZEN

nathan-rice commented 7 years ago

Page 84 occurs under both the Urizen and Ahania headings due to the structure of the XML. Currently the javascript groups query result text by heading, using the first heading on the page. As a result, though it is in Ahania, the heading for the result is Urizen.

Changing this behavior to fix this (for example, by taking the last heading) will just break other cases. The best solution is to move the <pb page="#"> outside the <div2> containing Urizen. There isn't really a good solution to this problem given the current data model, and I doubt the problem is a big enough deal to warrant overhauling that.

blakearchive / erdman

minor: poem titles as search results are coming up under previous poem title #54