Closed andrew-morrison closed 6 years ago
Hi Andrew - I think the best thing to do is to work on standardising first of all the data, and then practice, rather than putting a lot of complication into the XSLT. I'm currently finishing a few things off but hope to get back onto Fihrist in a fortnight or so. In the meantime, if there's anything that's holding you up do flag it as urgent and I'll make time!
This issue:
1369 are like this:
was created by this snippet from the conversion script (common-mss.xsl)
<xsl:template match="origin//date" priority="1000">
<origDate>
<xsl:choose>
<xsl:when test=".[normalize-space(.)=''][@calendar]">
<xsl:apply-templates select="@*|node()"/>
<xsl:value-of select="@calendar"/>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="@*|node()"/>
</xsl:otherwise>
</xsl:choose>
</origDate>
</xsl:template>
So these were actually empty date elements from the template and can be deleted. Do you want to do this or shall I?
This issue:
About 700 origDate elements contain easily identifiable Gregorian dates that could be converted, for example from
Yes, that sounds great - are you happy to go ahead and do this?
I'll see what I can do in the way of fixing low-hanging fruit. I'll report back here before actually making any changes.
For this issue:
Some manuscript genuinely have no known date of origin, but about 500 appear to contain years or centuries just not marked up in origDate elements
Would it be possible to send me the list of these mss, or the method you used to identify them? We can probably fix semi-manually
Probably the easiest method to find them, if you are using Oxygen, is to do a 'Find/Replace in Files' with the following options:
Click 'Find All' and it'll find all possible instances, and you can click on each to open the file and see the context. I find 608, but some are false matches (e.g. classmarks) and in other cases it is one manuscript with many parts, each with their own date.
Now going through some things with Yasmin. For this issue:
In contrast, 2179 record the fact that they are undated in origDate elements such as Not dated.
Yes, please go ahead and convert to using a
tag as suggested.
For the issue of multiple
Does that make sense? We may do something more sophisticated with multiple calendars in the fullness of time
Re: Your proposed rules for populating the "Century" facet:
origDate
without a @calendar
is a Gregorian date?P.S. If you enclose attributes in backticks it avoids GitHub's auto-lookup of usernames.
Thanks Andrew, perfect - and thanks for the tip re: backtips!
I have developed a script to do a one-off batch conversion to fix several of the above date issues.
It could also go further, and translate Hijri-qamari dates into Gregorian using a simple formula. But I'd need to know what to do with them. According to the TEI documentation, the values of attributes like @when
and @notBefore
should be, "a normalized representation of the date ... using the Gregorian calendar." So I could set up the script to convert examples like this...
<origDate calendar="#Hijri-qamari" when="1073">1073</origDate>
...to this...
<origDate calendar="#Hijri-qamari" notBefore="1662" notAfter="1663">1073</origDate>
...but there would be other examples which would require manual intervention.
Fantastic! Just hold fire on the Hijri-qamari date conversion til I've had a chance to talk to Yasmin
As far as the web site is concerned, the important thing is consistency. The indexing scripts can be set up to handle all attributes (as far as possible) normalized to Gregorian, or to whatever the @calendar
is, but not the mixture of styles at the moment.
Just to be clear - are we talking about instances where you have a date like this:
And it's the only date in the record? In that case, I think we would want the attribute converted to Gregorian so that the record has a presence in the date index/faceting.
Or are you also talking about including examples like this:
where there is a Gregorian date we can use? For these cases, I'd prob have to ask Yasmin in case there is a valid purpose for also having the H-Q date in machine readable form
Sorry - I don't think the last example came through as intended! The elements have disappeared from the snippets - I will try again in a minute
Hi Andrew, just sitting with Yasmin and she says YES - if you can go ahead and do the conversion to make the attributes on HQ dates Gregorian
OK. Will do. I'll let you know how many dates are changed, and upload to the QA site for you to see the results, hopefully later this week.
This will need to be communicated to the cataloguers, so they set the attributes to Gregorian dates in the future. I could probably do something with Schematron rules to help.
I've been working on this in between other things, and have developed a script which I plan to run overnight on Wednesday and upload the resulting modified TEI files on Thursday.
Attached is a spreadsheet detailing what it will do. It contains three tabs: Changed is the origDate elements the script will modify(~9000), Unchanged it will leave alone (~8000), and then there are 500 which the script cannot determine whether they are good or bad so it will add a comment into the TEI to flag them for manual checking in the future.
The important thing is that will allow origDate elements with a calendar of "#Hijri-qamari" to be used in the Century facet in the manuscripts search on the web site. I estimate that will mean ~1900 manuscripts no longer categorized as "Gregorian Date Not Specified", leaving about 4500 genuinely undated.
FYI, there are date issues which the script cannot pick up. For example in this manuscript there is this origDate
...
<origDate calendar="#Hijri-qamari" when="1984">1363</origDate>
..which looks perfectly fine (an Islamic date in the text of the element, as identified by the @calendar
of "#Hijri-qamari", with the @when
normalized to a higher number, so presumably the Gregorian year). But 1363H does not correspond to 1984BC. Instead, searching online, 1363 appears to be 1984 in one of a number of Iranian or Persian calendars. So either the @calendar
is wrong, or the wrong conversion was used to calculate 1984 as the Gregorian date, and @when
should be 1944. But there is no way to know which without going back to the sources used in its cataloguing.
I've run my date fixing script. It has changed the Century filter on manuscripts search from...
Gregorian Date Not Specified 6,347
17th Century 1,388
18th Century 1,256
19th Century 1,183
16th Century 901
15th Century 551
14th Century 334
13th Century 234
20th Century 173
12th Century 61
11th Century 30
9th Century 17
10th Century 11
8th Century 7
6th Century 1
7th Century 1
...to this...
Undated 4,278
19th Century 1,781
18th Century 1,739
17th Century 1,700
16th Century 1,118
15th Century 688
14th Century 439
13th Century 342
20th Century 255
12th Century 185
11th Century 128
Date not machine-readable 102
Date in unsupported calendar 83
10th Century 63
9th Century 48
8th Century 25
7th Century 19
6th Century 8
3rd Century 7
2nd Century 4
1st Century 2
Some of the earlier one are probably wrong, but I think I've flagged most of them to help manual checking, by inserting XML comments in the TEI. I'm going to write some instructions, based on the discussion above, and the TEI documentation, at https://git.io/fihrist-dates
I can rename those three options for dates that cannot be included, or merge them. I could also append " CE" to the centuries, to make it clear that those are Gregorian centuries.
The fihrist-prd.bodleian.ox.ac.uk web site has been updated, but the modified TEI files haven't been committed to GitHub yet. I'll try to do so today, but may need to go home early because of the snow.
Following on from #15, there are currently 6347 manuscripts that cannot be included in the Century filter because the dating of their origins isn't clear. Of these:
origDate
elements at allorigDate
but none with a@calendar
of "#Gregorian"origDate
but none with a machine-readable year in the form of one or more of the following attributes:@when, @notBefore, @notAfter, @from, @to
Some of the more obvious issues are:
origDate
elementsorigDate
elements such as<origDate calendar="#Gregorian">n.d.</origDate>
. Maybe a note or a p would be better:<origin><p>Not dated.</p></origin>
<origDate calendar="#Gregorian">Gregorian</origDate>
which I assume is the result of a previous batch conversion. There's also a similar number like this:<origDate calendar="#Hijri-qamari">Hijri-qamari</origDate>
origDate
elements contain easily identifiable Gregorian dates that could be converted, for example from<origDate calendar="#Gregorian">19th century</origDate>
to<origDate calendar="#Gregorian" notBefore="1800" notAfter="1900">19th century</origDate>
More tricky are examples like this:
Presumably 1682 is the year in the Gregorian calendar that corresponds to 19th Ramaḍān 1093 in the Hijri-qamari calendar. I could change the indexing scripts for Fihrist to regard all attributes as always in the Gregorian calendar, but that would preclude adding another filter for Islamic dates, and it is not consistently the case, for example:
I think separate elements for the same date in the different calendar systems is preferable.
In general, there is a much wider variety of styles employed in the
origin
section than the Medieval catalogue. To get these to display nicely is going to require either some new Schematron rules to impose some standardization, or a lot of work on the XSLT. Currently that is set up to handle the style employed by Medieval which is, for the most part, to put all relevantorigDate
andorigPlace
fields as the first children oforigin
, then follow with ap
containing further details if necessary. The XSLT inserts semi-colons into the HTML to separate dates and places.@eifionjones: I'm not sure what the best approach is to above, or how important it is (although dates usually are one of the most-used filters in any online catalogue.)