Index full page text into a new SOLR field using existing ALTO files

As noted in the epic and in planning discussions, we will check works' FileSets to see if they contain an ALTO xml file (which has the file use of "Extracted"). That page-level XML file will contain text content as well as page coordinates.

If there is "Extracted" that contains a .pos file, we do not want to use these.

If there is no "Extracted" file attached to the FileSet, we can instead look for a .txt ("Transcript File"). These will contain text data, but no word coordinates. This should still provide some search within IIIF capabilities.

Examples of works with page-level ALTO files: This work contains page-level ALTO and has already been indexed for full text search for the entire work. https://curate-test.library.emory.edu/concern/curate_generic_works/453wstqk05-cor?locale=en&page=2 This work also contains ALTO, but has not been indexed for full text yet. https://curate-test.library.emory.edu/concern/parent/7203xsj44s-cor/file_sets/501pg4f51w-cor

Examples of works without ALTO files: https://curate-test.library.emory.edu/concern/parent/28380gb5xh-cor/curate_generic_works/4300zpc8g9-cor https://curate-test.library.emory.edu/concern/curate_generic_works/846d2547r6-cor?locale=en

emory-libraries / dlp-curate

Index full page text into a new SOLR field using existing ALTO files #2256