Open eporter23 opened 9 months ago
Current updates are: we are looking at using a SOLR plugin to assist with highlighting behavior. The indexing process is also underway and SOLR fields are in place. If the Solr field contains the text of the XML, the SOLR plugin automatically converts it into HOCR which the UV needs for the highlighting behavior.
As noted in the epic and in planning discussions, we will check works' FileSets to see if they contain an ALTO xml file (which has the file use of "Extracted"). That page-level XML file will contain text content as well as page coordinates.
If there is "Extracted" that contains a .pos file, we do not want to use these.
If there is no "Extracted" file attached to the FileSet, we can instead look for a .txt ("Transcript File"). These will contain text data, but no word coordinates. This should still provide some search within IIIF capabilities.
Examples of works with page-level ALTO files: This work contains page-level ALTO and has already been indexed for full text search for the entire work. https://curate-test.library.emory.edu/concern/curate_generic_works/453wstqk05-cor?locale=en&page=2 This work also contains ALTO, but has not been indexed for full text yet. https://curate-test.library.emory.edu/concern/parent/7203xsj44s-cor/file_sets/501pg4f51w-cor
Examples of works without ALTO files: https://curate-test.library.emory.edu/concern/parent/28380gb5xh-cor/curate_generic_works/4300zpc8g9-cor https://curate-test.library.emory.edu/concern/curate_generic_works/846d2547r6-cor?locale=en