emory-libraries / dlp-curate

Digital curation and preservation workbench for the Emory Preservation Repository.
11 stars 4 forks source link

Index full page text into a new SOLR field using existing ALTO files #2256

Open eporter23 opened 6 months ago

eporter23 commented 6 months ago

As noted in the epic and in planning discussions, we will check works' FileSets to see if they contain an ALTO xml file (which has the file use of "Extracted"). That page-level XML file will contain text content as well as page coordinates.

If there is "Extracted" that contains a .pos file, we do not want to use these.

If there is no "Extracted" file attached to the FileSet, we can instead look for a .txt ("Transcript File"). These will contain text data, but no word coordinates. This should still provide some search within IIIF capabilities.

Examples of works with page-level ALTO files: This work contains page-level ALTO and has already been indexed for full text search for the entire work. https://curate-test.library.emory.edu/concern/curate_generic_works/453wstqk05-cor?locale=en&page=2 This work also contains ALTO, but has not been indexed for full text yet. https://curate-test.library.emory.edu/concern/parent/7203xsj44s-cor/file_sets/501pg4f51w-cor

Examples of works without ALTO files: https://curate-test.library.emory.edu/concern/parent/28380gb5xh-cor/curate_generic_works/4300zpc8g9-cor https://curate-test.library.emory.edu/concern/curate_generic_works/846d2547r6-cor?locale=en

eporter23 commented 1 month ago

Current updates are: we are looking at using a SOLR plugin to assist with highlighting behavior. The indexing process is also underway and SOLR fields are in place. If the Solr field contains the text of the XML, the SOLR plugin automatically converts it into HOCR which the UV needs for the highlighting behavior.