benwbrum / fromthepage

FromThePage is a wiki-like application for crowdsourcing transcription of handwritten documents.
http://fromthepage.com
GNU Affero General Public License v3.0
171 stars 51 forks source link

Investigate Search UIs #4379

Open anabasftp opened 1 month ago

anabasftp commented 1 month ago

Please investigate search user interfaces, with a particular focus on how diverse types of items are returned and displayed from the following sites (plus any that you think of that are good ideas). Let's document, with commentary on what is unique or seems particularly good or useful, by grabbing screenshots and URLs of examples. Lets put this in a google doc, rather than an issue, for now.

Smithsonian's Archives of American Art Digital Florentine Codex at the Getty Transkribus Sites Google Book Search Google Art & Culture Google itself the BSB (bavarian state biblioteck. In german.) Europeana

There's some background/implementation thinking about search that might be useful to have in mind (or not!). These are my notes on how we'll probably implement search for FromThePage. Looking for parallels in the above sites would be useful.

Types of objects returned by search: organization, collection, work, snippet/page, tags

Scope each search: Find a Project, Org Page, Collection Page, Work Page

For each scoped search, “boost” base on what people want to find. Boosting is how you prioritize the obvious thinking they are looking for. We could facet on type of returned object – but very simple (copy Google, not a library) – put it on the top with taggy buttons, not a facet.

You can boost different things – i.e. “org name” x 10; “collection name” x 5; for any given scope.

Find a Project: boost organization names, then collections

Org Page: boost collection names

Collection: hmm?? Pages or works?

Work: page text/snippets may be the only type

Search dbs basically “flatten” the info you want to search on. So we’d define the info for an org: name, description, url – which is flattened (I think)

Data per object type:

org: name, description, url

Collection: name, description, tag (metadata??)

Work: name, description, metadata

Page: name, transcription/other text

Snippet: phase 2? tied to pages if we have bounding boxes

Unlike a database, where empty fields take up space, they don’t in these search dbs. Makes it cheaper.

The biggest challenge we’ll run into may be keeping it in sync.

TODO: research ruby libraries for sending object data to elasticsearch or solr, etc. and keeping it in sync.

Do an analysis of searches over the last month. From where? What do we think they were trying to find?

Come up with some test searches – i.e. the Getty always had to return the parking info from the website on a “Parking” search, even though they had images of parking lots and historical institutional docs about the parking lot.

Pay attention to scope.

TODO: this would be a great intern task