Open katfish2 opened 2 years ago
@katfish2 thank you for submitting this request and I appreciate the use cases you've supplied. FYI, our software engineers had recently been researching this and if we move forward on any prototyping I will be in touch.
@eporter23, that's great news! I look forward to hearing what they find out and whether this enhancement can move forward at some point.
As of March 2023, we have a preliminary prototype of full text searching capabilities (across the repository) within Curate. Additional work must be done to reindex applicable text materials and also add this functionality to the Emory Digital Collections front-end. A second round of development is further needed to enable searching within a specific work in the Universal Viewer (page-level results). We will resume work on this as the development team has time, but they are heavily allocated this spring.
We are prioritizing the remainder of this work to be completed in an enhancement work cycle in FY24.
Is your feature request related to new functionality not yet included in the product? Please describe. Is your feature request related to a problem or a change to existing functionality? Please describe. Rose Library staff would like OCR files and other textual content in the repository to be indexed and searchable in Lux. This would dramatically improve access by allowing discovery and retrieval of items based on terms found within the objects but not specified in their metadata. Currently it is difficult for users to locate all relevant content, especially if their research topics are unusual or use archival materials in ways we didn't anticipate when creating metadata. Full-text search is currently possible in Luna and one reason we have not yet moved toward migration of the Emory Wheel, as end users of that collection rely heavily on the ability to search for specific names, dates, events, etc. within the newspaper and it would not be possible to capture all of the necessary information in metadata. Also, if we are able in the future to implement more folder-level digitization and batch metadata assignment for archival collections, having the option to OCR and search within the content could help balance out the reduced granularity of the descriptive metadata.
Describe the solution you'd like Ideally we would like a user to be able to enter keywords in the EDC search box and retrieve any objects in which those terms appear. This could be the default behavior, or the user could be required to select whether to search metadata only or full text. It would also be helpful, but not essential, for the location(s) of the terms to be highlighted when the user views an item from the list of results (CONTENTdm is one example of a system that does this).
Describe alternatives you've considered Current workarounds: 1) Directing researchers to download PDFs (when available) and search within them (but this requires starting with a known item or downloading and searching within multiple files instead of using full-text search to locate promising items in the first place). 2) Enhancing metadata when possible to highlight notable content and increase access points. 3) Leaving content for which full-text search is particularly important (e.g., the Wheel) in other systems, keeping our discovery and access landscape fragmented.
How will this impact users? Benefits: We expect all users, whether internal/staff users, Emory students, or outside researchers, would benefit from this functionality. It would align the system more closely with users' expectations based on other common database and repository tools, enhance content discoverability, and increase the efficiency of many search and retrieval behaviors. Drawbacks: The change would increase the complexity of searching in EDC so might require retraining or additional help features for researchers. It might make relevance rankings more complicated and/or less predictable. Because current digitization workflows don't include OCR for all text-based content, we would end up with some items that are fully searchable and some that aren't; this could be confusing or frustrating to users.
Additional context