fihristorg / fihrist-mss

Fihrist TEI Catalogue
22 stars 10 forks source link

Date Issues #17

Closed andrew-morrison closed 6 years ago

andrew-morrison commented 6 years ago

Following on from #15, there are currently 6347 manuscripts that cannot be included in the Century filter because the dating of their origins isn't clear. Of these:

Some of the more obvious issues are:

More tricky are examples like this:

<origin>
    <origDate calendar="#Hijri-qamari" when="1682">Dated 19th Ramaḍān 1093</origDate>
</origin>

Presumably 1682 is the year in the Gregorian calendar that corresponds to 19th Ramaḍān 1093 in the Hijri-qamari calendar. I could change the indexing scripts for Fihrist to regard all attributes as always in the Gregorian calendar, but that would preclude adding another filter for Islamic dates, and it is not consistently the case, for example:

<origin>
    <origDate calendar="#Coptic-EoM">1071</origDate>
    <origDate calendar="#Hijri-qamari" when="0765">765</origDate>
    <origDate calendar="#Gregorian" when="1355">1355</origDate>
</origin>

I think separate elements for the same date in the different calendar systems is preferable.

In general, there is a much wider variety of styles employed in the origin section than the Medieval catalogue. To get these to display nicely is going to require either some new Schematron rules to impose some standardization, or a lot of work on the XSLT. Currently that is set up to handle the style employed by Medieval which is, for the most part, to put all relevant origDate and origPlace fields as the first children of origin, then follow with a p containing further details if necessary. The XSLT inserts semi-colons into the HTML to separate dates and places.

@eifionjones: I'm not sure what the best approach is to above, or how important it is (although dates usually are one of the most-used filters in any online catalogue.)

eifionjones commented 6 years ago

Hi Andrew - I think the best thing to do is to work on standardising first of all the data, and then practice, rather than putting a lot of complication into the XSLT. I'm currently finishing a few things off but hope to get back onto Fihrist in a fortnight or so. In the meantime, if there's anything that's holding you up do flag it as urgent and I'll make time!

eifionjones commented 6 years ago

This issue:

1369 are like this: Gregorian which I assume is the result of a previous batch conversion. There's also a similar number like this: Hijri-qamari

was created by this snippet from the conversion script (common-mss.xsl)

<xsl:template match="origin//date" priority="1000">
    <origDate>
        <xsl:choose>
            <xsl:when test=".[normalize-space(.)=''][@calendar]">
                <xsl:apply-templates select="@*|node()"/>
                <xsl:value-of select="@calendar"/>
            </xsl:when>
            <xsl:otherwise>
                <xsl:apply-templates select="@*|node()"/>
            </xsl:otherwise>
        </xsl:choose>
    </origDate>
</xsl:template>

So these were actually empty date elements from the template and can be deleted. Do you want to do this or shall I?

eifionjones commented 6 years ago

This issue:

About 700 origDate elements contain easily identifiable Gregorian dates that could be converted, for example from 19th century to 19th century

Yes, that sounds great - are you happy to go ahead and do this?

andrew-morrison commented 6 years ago

I'll see what I can do in the way of fixing low-hanging fruit. I'll report back here before actually making any changes.

eifionjones commented 6 years ago

For this issue:

Some manuscript genuinely have no known date of origin, but about 500 appear to contain years or centuries just not marked up in origDate elements

Would it be possible to send me the list of these mss, or the method you used to identify them? We can probably fix semi-manually

andrew-morrison commented 6 years ago

Probably the easiest method to find them, if you are using Oxygen, is to do a 'Find/Replace in Files' with the following options:

Click 'Find All' and it'll find all possible instances, and you can click on each to open the file and see the context. I find 608, but some are false matches (e.g. classmarks) and in other cases it is one manuscript with many parts, each with their own date.

eifionjones commented 6 years ago

Now going through some things with Yasmin. For this issue:

In contrast, 2179 record the fact that they are undated in origDate elements such as n.d.. Maybe a note or a p would be better:

Not dated.

Yes, please go ahead and convert to using a

tag as suggested.

eifionjones commented 6 years ago

For the issue of multiple elements in one element, could we use the following logic:

  1. If there is only one element, use that, regardless of entry in @calendar
  2. If there are multiple elements, and there is one with @calendar="Gregorian", use that one
  3. If there are multiple elements, and none have a @calendar="Gregorian", exclude from date faceting for now

Does that make sense? We may do something more sophisticated with multiple calendars in the fullness of time

andrew-morrison commented 6 years ago

Re: Your proposed rules for populating the "Century" facet:

  1. So assume any single origDate without a @calendar is a Gregorian date?
  2. Already the case.
  3. Already the case (except it includes them in "Gregorian Date Not Specified" which I only set up to show you the scale of the problem, we'll remove that before launch.)

P.S. If you enclose attributes in backticks it avoids GitHub's auto-lookup of usernames.

eifionjones commented 6 years ago

Thanks Andrew, perfect - and thanks for the tip re: backtips!

andrew-morrison commented 6 years ago

I have developed a script to do a one-off batch conversion to fix several of the above date issues.

It could also go further, and translate Hijri-qamari dates into Gregorian using a simple formula. But I'd need to know what to do with them. According to the TEI documentation, the values of attributes like @when and @notBefore should be, "a normalized representation of the date ... using the Gregorian calendar." So I could set up the script to convert examples like this...

<origDate calendar="#Hijri-qamari" when="1073">1073</origDate>

...to this...

<origDate calendar="#Hijri-qamari" notBefore="1662" notAfter="1663">1073</origDate>

...but there would be other examples which would require manual intervention.

eifionjones commented 6 years ago

Fantastic! Just hold fire on the Hijri-qamari date conversion til I've had a chance to talk to Yasmin

andrew-morrison commented 6 years ago

As far as the web site is concerned, the important thing is consistency. The indexing scripts can be set up to handle all attributes (as far as possible) normalized to Gregorian, or to whatever the @calendar is, but not the mixture of styles at the moment.

eifionjones commented 6 years ago

Just to be clear - are we talking about instances where you have a date like this:

1073

And it's the only date in the record? In that case, I think we would want the attribute converted to Gregorian so that the record has a presence in the date index/faceting.

Or are you also talking about including examples like this:

1071 765 1355

where there is a Gregorian date we can use? For these cases, I'd prob have to ask Yasmin in case there is a valid purpose for also having the H-Q date in machine readable form

eifionjones commented 6 years ago

Sorry - I don't think the last example came through as intended! The elements have disappeared from the snippets - I will try again in a minute

eifionjones commented 6 years ago

Hi Andrew, just sitting with Yasmin and she says YES - if you can go ahead and do the conversion to make the attributes on HQ dates Gregorian

andrew-morrison commented 6 years ago

OK. Will do. I'll let you know how many dates are changed, and upload to the QA site for you to see the results, hopefully later this week.

This will need to be communicated to the cataloguers, so they set the attributes to Gregorian dates in the future. I could probably do something with Schematron rules to help.

andrew-morrison commented 6 years ago

I've been working on this in between other things, and have developed a script which I plan to run overnight on Wednesday and upload the resulting modified TEI files on Thursday.

Attached is a spreadsheet detailing what it will do. It contains three tabs: Changed is the origDate elements the script will modify(~9000), Unchanged it will leave alone (~8000), and then there are 500 which the script cannot determine whether they are good or bad so it will add a comment into the TEI to flag them for manual checking in the future.

The important thing is that will allow origDate elements with a calendar of "#Hijri-qamari" to be used in the Century facet in the manuscripts search on the web site. I estimate that will mean ~1900 manuscripts no longer categorized as "Gregorian Date Not Specified", leaving about 4500 genuinely undated.

andrew-morrison commented 6 years ago

FYI, there are date issues which the script cannot pick up. For example in this manuscript there is this origDate...

<origDate calendar="#Hijri-qamari" when="1984">1363</origDate>

..which looks perfectly fine (an Islamic date in the text of the element, as identified by the @calendar of "#Hijri-qamari", with the @when normalized to a higher number, so presumably the Gregorian year). But 1363H does not correspond to 1984BC. Instead, searching online, 1363 appears to be 1984 in one of a number of Iranian or Persian calendars. So either the @calendar is wrong, or the wrong conversion was used to calculate 1984 as the Gregorian date, and @when should be 1944. But there is no way to know which without going back to the sources used in its cataloguing.

andrew-morrison commented 6 years ago

I've run my date fixing script. It has changed the Century filter on manuscripts search from...

Gregorian Date Not Specified    6,347
17th Century    1,388
18th Century    1,256
19th Century    1,183
16th Century    901
15th Century    551
14th Century    334
13th Century    234
20th Century    173
12th Century    61
11th Century    30
9th Century 17
10th Century    11
8th Century 7
6th Century 1
7th Century 1 

...to this...

Undated     4,278
19th Century    1,781
18th Century    1,739
17th Century    1,700
16th Century    1,118
15th Century    688
14th Century    439
13th Century    342
20th Century    255
12th Century    185
11th Century    128
Date not machine-readable   102
Date in unsupported calendar    83
10th Century    63
9th Century 48
8th Century 25
7th Century 19
6th Century 8
3rd Century 7
2nd Century 4
1st Century 2 

Some of the earlier one are probably wrong, but I think I've flagged most of them to help manual checking, by inserting XML comments in the TEI. I'm going to write some instructions, based on the discussion above, and the TEI documentation, at https://git.io/fihrist-dates

I can rename those three options for dates that cannot be included, or merge them. I could also append " CE" to the centuries, to make it clear that those are Gregorian centuries.

The fihrist-prd.bodleian.ox.ac.uk web site has been updated, but the modified TEI files haven't been committed to GitHub yet. I'll try to do so today, but may need to go home early because of the snow.