IUBLibTech / newton_chymistry

New version of 'The Chymistry of Isaac Newton', using XProc pipelines to generate a website based on TEI XML encodings of Newton's alchemical manuscripts, and Apache Solr as a search engine.
2 stars 0 forks source link

#90 Adds global text search field that includes TEI Header nodes #100

Open randalldfloyd opened 3 years ago

randalldfloyd commented 3 years ago

Adding a new field for text searching that includes text nodes from the TEI header. An additional field allows for creating separate behaviors between the advanced search text field and the global site search.

Conal-Tuohy commented 3 years ago

I don't think the change to the XSLT is needed. The additional field element in the search-fields.xml file should do the trick by itself.

The update-schema-from-field-definitions.xsl stylesheet already includes code to automatically add all the fields which are defined in the search-fields.xml file.

NB that stylesheet also explicitly adds the three "full text" fields diplomatic, normalized, and introduction independently, because they aren't defined in the search-fields.xml file. The reason why those "full text" fields aren't also defined in search-fields.xml is a bit complicated, but in brief it's because we want to populate those fields with text which exactly matches the web pages (so that the web pages can have hits highlighted in them, based on matches returned by Solr's hit-highlighting). Those web pages are the output of a chain of quite complex XSLT transformations (which have to suppress orig or reg elements, etc), so although in theory you could extract equivalent text from the TEI using an XPath expression that also suppressed the appropriate elements' content, in practice it seemed to me unwise, since any discrepancy between the field value and the web page's content would break the hit highlighting.

Conal-Tuohy commented 3 years ago

NB if you want the new tei-header field to also appear in the search form, you would need to give it a label attribute, e.g.

    <field name="tei-header" label="Metadata" xpath="/TEI/teiHeader//text()"/>

Without a label attribute, any new field will get added to the Solr schema, and the Solr field will get populated by the indexer (evaluating its xpath attribute), but it would not appear on the main search form. Fields without a label are invisible to the search UI, though they can have their uses, such as e.g. the id field, which is purely there to provide a unique ID for the record in Solr.

Conal-Tuohy commented 3 years ago

The only other thing I'd be wary of is the potential that this XPath expression might merge the content of adjacent elements into a single word, if there were no white space between the elements. e.g.

<p>Blah blah ... blah</p><p>Blah blah blah.</p>

Would produce Blah blah ... blahBlah blah blah.

Maybe it would be safer to use the string-join() function to explicitly add white space between each text node? e.g. string-join(/TEI/teiHeader//text(), ' ')

randalldfloyd commented 3 years ago

@Conal-Tuohy Thanks for the guidance on this. Also thanks for the additional comments you left over in the issue conversation. That helped solve a major mystery in my mind, which was how the actual document text was being put into the Solr fields after their definition. Going by their names only, I thought the xproc steps and stylesheets you pointed out were just for transforming P5 to HTML in the request so I didn't ever look at them, but now I see how they are used to transform to the Solr doc in the index pipeline.

mdalmau commented 2 years ago

@randalldfloyd : I am not really sure where we left off with this .... maybe when you get a breather (ha!) later in April, we can revisit?

randalldfloyd commented 2 years ago

@mdalmau I'll tell you honestly what I remember from this, and then you can tell me if it was just wishful thinking or not. After putting in a fair amount of work to demonstrate the ability to alter the keyword search behavior, you put out a message to the group asking for xpaths that could be included in the indexing of the text search field. To that, someone (Bill maybe?) responded that they couldn't see what the real need for this was, or what the problem was as it currently works, and nobody else responded that I was copied on. I had a test branch deployed somewhere, but it was probably lost in the moving around of services.