bodleian / medieval-mss

Medieval Manuscripts in Oxford Libraries: TEI catalogue descriptions
https://medieval.bodleian.ox.ac.uk
33 stars 35 forks source link

indexing of dates, etc., in multi part manuscripts #180

Open holfordm opened 5 years ago

holfordm commented 5 years ago

@ahankinson and I had an email conversation about this in March 2017. The question relates to the indexing of multi-part manuscripts where each part may have a different material, century of origin, county of origin, etc. An example is MS. Ashmole 210 https://medieval.bodleian.ox.ac.uk/catalog/manuscript_315

Currently if you filter for manuscripts with decoration produced in Italy in the fourteenth century, Ashmole 210 is one of the results. https://medieval.bodleian.ox.ac.uk/?f%5Bms_date_sm%5D%5B%5D=14th+Century&f%5Bms_deconote_b%5D%5B%5D=true&f%5Bms_origin_sm%5D%5B%5D=Italy&f%5Btype%5D%5B%5D=manuscript

This is because one of its parts is from Italy [but is seventeenth century and does not have decoration], another has decoration and is fourteenth century. The question is how far this is a wrong / misleading result. Andrew wrote at the time: "Determining whether a MSS meets their needs should be left to the user to decide, which means we should favour recall over precision (i.e., showing more results, even if they are potentially irrelevant to their specific needs). If they perform the type of search you are anticipating then they will still find it; they may just have to sift through a few more MSS than they might otherwise.

This is preferable to the alternative, which is that we do not show potentially relevant results because the user did not know how to express their intentions to our system. This would be favouring precision over recall. This is a much more difficult thing to get right, especially with data that's as multifaceted as this catalogue"

I'd like to reopen this is as a question to get the opinion of @andrew-morrison, @eifionjones and other catalogues about what desirable behaviour would be in such cases.

andrew-morrison commented 5 years ago

This would require indexing and displaying parts as separate records, in the same index as the non-composite manuscripts. So that instead of "MS. Ashmole 210" being one hit returned for any given search query, there would be potentially up to five hits, for each of the five parts. That way, none would appear in the example you gave, because they'd each have their own presence in the filters. But each would have then have to have their own web page, which the XSL could be modified to build, repeating the common bits in each. That would be better for people looking for Italian decorated 14th century material, but maybe not as good in other scenarios. Also, msPart has been used for other things, such as endleaves.

It would be a lot of development work. Pretty much every indexing script would need modification, because anything that potentially links to parts would need changing.

After Medieval, which has almost 1000 msPart elements, Genizah is next with 175, but as that is collections of fragments there probably wouldn't be much to gain, as even the ones not divided into parts are made up of works with lots of different provenances.

About 10% of Fihrist are probably composite manuscripts, but except for 100 mostly Wellcome Collection ones, they haven't been marked up using msPart. This is presumably because the old web site, as far as I can tell, ignored parts. Instead, a workaround of internal cross-references was adopted by some, and then erroneously copied by others. We've been considering whether/how to convert these to use 'msPart` but the trouble is there are some that are genuinely multi-part with multiple distinct origins and others containing works with origins that differ only slightly (e.g. the first ten dated 1172 and the next five 1173).

So this would be a Medieval-only enhancement for the foreseeable future.

An alternative might be to display multiple lines under composite manuscripts in the search results, where currently there is just the one "Contents:". Possibly something like this:

multi-part-mockup

It's still appearing in the search results that arguably it shouldn't, but at least users could probably figure out why, without having to click the link and read the entire manuscript.

holfordm commented 5 years ago

The idea of listing each part separately looks like it could be a good temporary solution.

I don't know enough about SOLR / Blacklight, but would we really have to index and display parts as separate records (which would be undesirable)? The ideal solution would be to still have the 'parent' or 'master' record for (say) MS. Ashmole 210, but for some of the information in that record to be in separate 'part' records linked to the parent - something like the following. But maybe this isn't technically possible?

<doc>
      <field name="type">manuscript_part</field>
      <field name="parent_ms">manuscript_315</field>
      <field name="id">manuscript_315-Part 1</field>
      <field name="ms_materials_sm">Parchment</field>
      <field name="ms_deconote_b">true</field>
      <field name="ms_digitized_s">No</field>
      <field name="lang_sm">Latin</field>
      <field name="lang_sm">English</field>
      <field name="ms_origin_sm">England</field>
      <field name="ms_date_sm">14th Century</field>
</doc>
ahankinson commented 5 years ago

Perhaps it would be helpful for you to say what you would expect to see in a search for, e.g. 17th C and Italy. What would the list of results look like?

At the MS level it is certainly True that Ashmole 210 meets those criteria. It may not at the individual part level, but because we don’t index those as separate retrievable records we can’t therefore return them as a result.

Even the parent/child separation would have the same result. We could index individual parts (the children) but retrieve and show the MS record (the parent). But the end result would be the same. If one or more children meet one or more criteria, we would have to show Ashmole 210. It’s just a more complicated way of arriving at the same result.

ahankinson commented 5 years ago

Sorry didn’t mean to close!

andrew-morrison commented 5 years ago

We could index parts alongside manuscripts, moving the relevant index fields for facets such as Origin and Century from the latter to the former. And we could probably find the bit of the Blacklight code that builds the links in search results and modify it so that, while there would be pages such as /catalog/part_999, nobody would find them because the links would be changed to /catalog/manuscript_123#part_2, for example. It is buried more deeply than other things we've tweaked, and might be broken by future upgrades.

Then the MS. Ashmole 210 problem wouldn't occur, because the Solr record for that manuscript wouldn't have anything in those facet fields. It would disappear from results as soon as either Italy or 14th Century were selected, and instead the parts which match would be listed, then all of those would be gone when the second filter is applied. Only parts from other manuscripts, plus entire non-composite manuscripts, which are specifically from Italy in the 14th Century, would be returned.

The downside is that multiple parts from the same composite manuscripts whose origins are all the same or similar might flood the search results. MS. Ashmole 210 would disappear from the results, but MS. Canon. Ital. 157 would appear 7 times, once for each of its 7 parts, all of which are from 14th Century Italy. You could change the TEI to only have one history section in the msDesc for the entire thing when all parts are the same, but there are other examples which are more complicated. For example, MS. Lat. misc. b. 18 contains one part from 14th Century Italy but also another 54 parts, 3 from Italy but not the 14th Century, 4 French, 25 English, and 22 without an origPlace. 10 are English from the 15th Century so browsing for that would mean 10 records where there's currently only one for that manuscript alone. Overall, it could add hundreds more results. They're all relevant, and it might be preferable for some users, but for others it would be more work to find what they want.

There would also have to be some overlap in fulltext indexing between parts and the parent manuscripts, so that people could still search for, say, a manuscript they remember containing works X and Y, but in different parts. So keyword searching would also return more hits (in some cases many more) than currently.

Potentially these extra hits in search results could be minimized with field collapsing. That uses the same underlying Lucene/Solr feature that allows SOLO to group multiple editions of book into one, with a link to "See all versions". But that can be tricky to set up. The search engine chooses which record to display as representative of the others based on relevance ranking, so sometimes it would be a part and sometimes the whole manuscript.

All the above would be quite a lot of development work, and I need to concentrate on writing up documentation, so I'll park this for a while. Meanwhile, when I get a chance, I'll set up the QA server to list parts in the search results, so you can see if that does enough to avoid confusion.

andrew-morrison commented 5 years ago

I've implemented listing of the origins of parts under each manuscript in search results on QA. So filtering on 14th Century and Italy still returns MS. Ashmole 210 but you can see that its only 14th Century part is English and its only Italian part is 17th Century.

Listing of parts is only done if the manuscript has no overall origin, head, or summary. The line for each part is built using the same logic as for whole manuscripts requested in #123. If multiple parts contain precisely the same work(s) and origin they'll only be listed once. There is a cut-off of 15 parts, and anything with more than that displays the first 10 then the number of further parts. But that is currently only the case for two manuscripts.

You can no longer search for "composite manuscript" as you can on production, but that was never a foolproof way to find mutli-part manuscripts, because anything with a head or summary don't match.

@holfordm: Let me know if this looks good, and is a reasonable mitigation of this issue.

andrew-morrison commented 5 years ago

I've added "Multiple" options to the Language, Century and Origin facets, plus extended "Mixed" in Materials to include manuscripts with multiple distinct supportDesc/@material values in different parts (not just if there is an explicit "mixed" value), to make up for the loss of being able to search for "composite manuscript". It's on QA currently, but if you don't think it is useful I can remove it.

holfordm commented 5 years ago

Looking at how other institutions have dealt with this. Biblissima index entry for a Pseudo-Cicero text has some entries for "manuscripts" and some for "parts of manuscripts". http://beta.biblissima.fr/fr/ark:/43093/oedata6faf100c5a7ac93a73a7cd50662ef5e358ba368f

andrew-morrison commented 5 years ago

So that is their equivalent of this:

https://medieval.bodleian.ox.ac.uk/catalog/work_3977

We could do similar, something like this mockup:

mockup_parts_in_work_pages

That would be relatively easy, without requiring the major work to set up a separate index for parts needed to list them separately in search results, as discussed previously.