Works index/button - Githubissues

andrew-morrison commented 6 years ago

@eifionjones: You said in an email last week that, "We might implement indexes for works and places in time, but currently the data is not in a state to support them."

For works in particular, what needs to be done?

eifionjones commented 6 years ago

As far as I understood it we need identifiers for elements to be indexed? And we haven't done the work of marking up places or works with identifiers as we have with names and subjects. But if there's a way round this that would be great!

andrew-morrison commented 6 years ago

All msItem elements in the Fihrist TEI files have been given a unique @xml:id by a previous batch conversion (basically the classmark with -itemN appended.) So I could set it up to index every msItem as a work, and each would get a page that links to a single manuscript.

What that wouldn't do is aggregate copies/versions/interpretations of the same work in different manuscripts, such as this example in Medieval: http://medieval-qa.bodleian.ox.ac.uk/catalog/work_15496

To enable that, each title has been given a @key attribute (in the above example seven have been given the same value of 'work_15496') and the title displayed on that page is set by yet another authority file which Matthew Holford maintains.

But maybe works in Fihrist differ in nature?

I could create a spreadsheet of all your msItem elements and their titles, if it would help you review?

eifionjones commented 6 years ago

Again, Yasmin says YES - please go ahead and create the works index on that basis. It will give us the ability to see the data and verify what the next steps would be in creating a deduplicated index like Matthew's.

andrew-morrison commented 6 years ago

OK, I'll do that.

The one thing to note is that if Fihrist launches with a works index based on those auto-generated IDs, and then later switches to a controlled version with manually-assigned IDs, the URLs of the pages for each work will change. Probably not a big deal - they aren't the sort of pages that people are likely to bookmark or cite.

andrew-morrison commented 6 years ago

I have applied a script I've developed for our Hebrew and Genizah collections to build a works authority file and, from that, an index which is now on the Fihrist QA site. What I learned from that (much smaller) catalogue is that, for works, some of what I said above isn't the case:

Some work titles are common, and those can be merged into one entry in the index, which link to multiple manuscripts. I've tried various methods to increase the amount of deduplication that can be done by simple pattern matching. Some are probably uncontroversial (e.g. Anthology of poetry links to three manuscripts, one containing an msItem with precisely that title, one with "An anthology of poetry" and the other with "An anthology of poetry.") Others you might prefer to keep as separate entries in the index (e.g. Commentary on the Qurʼān includes "Fragment of a commentary on the Qurʼān"). I've attempted no transliteration, but could do, if a suitable mapping table could be found.
Because, unlike people and places, we don't turn work titles into hyperlinks, adding @key attributes to the source TEI isn't necessary for works, which would make the authority file much easier to maintain.

andrew-morrison commented 6 years ago

While trying to implement #27, I have found some ridiculous numbers. For example, the Wellcome Trust has the most "works", over 5000, as defined by an msItem with an ID and a title, of any collection, despite only having 78 manuscripts.

So I'm going to try excluding any msItem from the works index if it has an @n attribute, indicating they're part of a numbered set, or an entry in a table of contents, if they are inside another msItem that has its own title. That should remove a lot of the "Chapter 1" (or "Capitulo primero") entries, individual poems in collections of poetry, and what looks like fragments in the Wellcome Trust. That should reduce the work count from 17,342 to around 11,000.

eifionjones commented 6 years ago

Sounds good!

andrew-morrison commented 6 years ago

Just a note of a few msItems which contain nothing but "alt" type titles:

Or_208-item9 in cambridge university/Or_208.xml
Z_10-item1 in jesus college (cambridge)/Z_10.xml
CODRINGTON.READE_NO_80_BOX_37-item1 in CODRINGTON.READE_NO_80_BOX_37.xml

Possibly this means neither title is the "main" one. Or maybe a Schematron rule should be created for this. Anyway, first in document order will be displayed in the works index.

holfordm commented 6 years ago

"Because, unlike people and places, we don't turn work titles into hyperlinks, adding @key attributes to the source TEI isn't necessary for works, which would make the authority file much easier to maintain." https://github.com/bodleian/fihrist-mss/issues/21#issuecomment-363177680

is the fact that work titles don't hyperlink a considered judgement, or just something we haven't gotten around to yet? (or both depending on the catalogue?)

andrew-morrison commented 6 years ago

I had assumed it was a design decision, but perhaps not.

I could easily write some XSL to apply @key attributes to the works in the TEI files. Then work titles in manuscript description pages could link back to lists of the same work in other manuscripts. That would be useful for, say, the Qurʼān, but 83% of work pages contain only one link, which would go back to the manuscript you just came from. In Medieval it is 77%.

If this feature is desired, I'd still recommend leaving it, in the case of Fihrist, until some time has been found for some curation work on the authority file.

@holfordm: If you want it in Medieval, where it would just be a stylesheet change, raise it as an issue.

andrew-morrison commented 6 years ago

I think I've done everything I can to build an authority file for whole works and their variant title forms. I might have been able to do more deduplication, but I think a lot of the title elements with a type attribute of "alt" are actually subheadings or something else entirely (e.g. "الكافية" which, if Google Translate is to be believed, means "Adequate" or "Sufficient").

The result is that on the QA site there are now 10,912 entries in the works index. That is actually less than the number of manuscripts, 11,842. But that's not a mistake. If you were to click on each work, and count the number of links to manuscripts containing instances of those works, the total number would be 17,170. That gives a more reasonable-sounding average of 1.45 works per manuscript.

The most repeated work is, as you'd expect, the Qurʼān (http://fihrist-qa.bodleian.ox.ac.uk/catalog/work_112) which is in 293 manuscripts under lots of different titles (e.g. "al-Qurʼān", "Kuran", "Quran", "Qurʼānic fragments", "Extracts from the Koran employed in administering oaths".)

eifionjones commented 6 years ago

Great, many thanks for your work on this Andrew!

andrew-morrison commented 6 years ago

I'm closing this issue as the basic works index is up and working, and the scripts and authority files are in place to be able to re-index as and when the data is improved. Please raise a new issue for any change requests.

fihristorg / fihrist-mss

Works index/button #21