WaxCylinderRevival / frus-dates-project

Project repository for FRUS date extraction and normalization initiative
https://history.state.gov
GNU General Public License v3.0
1 stars 0 forks source link

Add default dateTime values for non-document sections of FRUS #1127

Closed WaxCylinderRevival closed 6 years ago

WaxCylinderRevival commented 6 years ago

To deliver more useful search results, we will add @frus:doc-dateTime-min and @frus:doc-dateTime-max to divs previously considered non-datable.

These datable divs include:

Proposed Steps:

1. Update volume coverage dates

1.a. [x] Run .xq query to determine document dates outside of current volumes dates in bibliography

1.b. [x] Update coverage values in bibliography

1.c. [x] Add @type="publication-date" to teiHeader/publicationStmt/date

1.d. Long-term TODO: Incorporate bibliography information into teiHeader or other appropriate place for volume

2. Add volume coverage dates to appropriate divs

2.a. Transform volume coverage/@from to div/@frus:doc-dateTime-min and coverage/@to to div/@frus:doc-dateTime-max and add div/@ana="#date_undated-temporarily-inferred-from-volume-rules" for the following:

Front Matter

Back Matter

2.b. Transform the earliest @frus:doc-dateTime-min of descendant div[subtype="historical-document]" and the latest @frus:doc-dateTime-max of descendant div[subtype="historical-document]" and add div/@ana="#date_undated-temporarily-inferred-from-volume-rules" for the following:

Body

2.c. Identify documents dates and transform to @frus:doc-dateTime-min and @frus:doc-dateTime-max and add div/@ana="#date_undated-temporarily-inferred-from-volume-rules" for the following:

WaxCylinderRevival commented 6 years ago

@joewiz, please read above for the start of a proposal for adding default @frus:doc-dateTime-min and @frus:doc-dateTime-max values for non-document sections of FRUS

Corrections, suggestions, and questions welcome

I will list other div[@type="section"]/@xml:idin the comments below for discussion

WaxCylinderRevival commented 6 years ago

Sampling of additional div[@type="section"]/@xml:id found:

joewiz commented 6 years ago

This is great! Thanks for putting it together. Just a couple of thoughts:

  1. You proposed using div/@ana="#date_undated-temporarily-inferred-from-volume-rules" - but do you think "temporarily-" is the right term here? It implies that we'll be doing a future review of the dates applied to these non-document divs. Is that what you intend?

  2. In the search interface, how should we indicate the date judgements when these appear as results? Something like: "Date range [or "Inferred date range"?]: {date-min}-{date-max}" and "Date methodology": "Inferred using volume rules."

  3. I hadn't thought of using @xml:id as a clue for determining which date rules to apply, but this is a good idea if we can come up with some general rules. We have some frequently used @xml:id values for common sections headings, but by convention we allow variation for sections that aren't common and/or need unique IDs. Here is a query that produces a comprehensive list, for our reference (must... resist... urge... to normalize...):

xquery version "3.1";

declare namespace tei="http://www.tei-c.org/ns/1.0";

array { 
    collection("/db/apps/frus/volumes")//tei:div
        [@type eq "section"]
        /@xml:id 
        => distinct-values() 
        => sort() 
}

The results:

[
    "AbouttheSeries",
    "Contents",
    "Index",
    "Notes",
    "Preface",
    "Published",
    "Shorttitles",
    "Summary",
    "Unpublished",
    "Volumes",
    "about",
    "about-this-preview-edition",
    "aboutseries",
    "abouttheseries",
    "abtseries",
    "acknowledge",
    "actionssatement",
    "actionsstatement",
    "actionstatement",
    "address-of-the-president",
    "annual",
    "app1map1",
    "app1map2",
    "app1map3",
    "app1map4",
    "app2map1",
    "app2map2",
    "app2map3",
    "app2map4",
    "appendix",
    "appendix-1",
    "appendix-10",
    "appendix-11",
    "appendix-12",
    "appendix-2",
    "appendix-3",
    "appendix-4",
    "appendix-5",
    "appendix-6",
    "appendix-7",
    "appendix-8",
    "appendix-9",
    "appendix1",
    "appendix2",
    "appendix_a",
    "appendix_b",
    "charts",
    "circulars",
    "citations",
    "correspondence-1",
    "correspondence-2",
    "covert",
    "delegation",
    "delegations",
    "directory",
    "documents",
    "editorial",
    "errata",
    "front-matter",
    "guide",
    "historian",
    "illustrations",
    "index",
    "index-persons",
    "index-subjects",
    "intro",
    "intro1",
    "intro2",
    "intro3",
    "intro4",
    "intro5",
    "intro6",
    "intro7",
    "intro8",
    "introduction",
    "introductory",
    "list-of-illustrations",
    "map",
    "map-panama",
    "maps",
    "message-of-the-president",
    "message-of-the-president-1",
    "message-of-the-president-2",
    "messages-of-the-president",
    "messages-of-the-president-1",
    "messages-of-the-president-2",
    "note",
    "notes",
    "papers",
    "papers-countries",
    "papers-topics",
    "persons",
    "persons-mentioned",
    "photographs",
    "photographs-toc",
    "photos",
    "preface",
    "prefatory-note",
    "pressrelease",
    "san-Francisco-earthquake",
    "sec-10thPlenary-Oct2",
    "sec-11thMeeting-Oct2",
    "sec-12thPlenary-Oct2",
    "sec-13thPlenary-Oct2",
    "sec-14thMeeting-Oct3",
    "sec-1stMeeting-July11",
    "sec-1stMeeting-July12",
    "sec-1stMeeting-Oct20",
    "sec-1stMeeting-Sept28",
    "sec-1stPlenary-Dec4",
    "sec-1stPlenary-Sept28",
    "sec-1stRestTripartite-Dec4",
    "sec-1stTripartite-Dec4",
    "sec-1stTripartite-July10",
    "sec-2ndMeeting-July14",
    "sec-2ndMeeting-Oct1",
    "sec-2ndMeeting-Oct21",
    "sec-2ndMeeting2-Oct21",
    "sec-2ndPlenary-Dec5",
    "sec-2ndPlenary-Sept28",
    "sec-2ndRestrictedTripartite-Dec7",
    "sec-2ndTripartite-Dec5",
    "sec-2ndTripartite-July11",
    "sec-3rdMeeting-Oct1",
    "sec-3rdMeeting-Oct22",
    "sec-3rdPlenary-Dec6",
    "sec-3rdPlenary-Sept29",
    "sec-3rdTripartite-Dec6",
    "sec-3rdTripartite-July13",
    "sec-4thMeeting-Oct2",
    "sec-4thPlenary-Dec7",
    "sec-4thPlenary-Sept29",
    "sec-4thTripartite-Dec6",
    "sec-4thTripartite-July13",
    "sec-5thMeeting-Oct3",
    "sec-5thPlenary-Sept30",
    "sec-5thTripartite-Dec7",
    "sec-5thTripartite-July14",
    "sec-5thTripartite2-Dec7",
    "sec-6thPlenary-Dec7",
    "sec-6thPlenary-Sept30",
    "sec-7thPlenary-Oct1",
    "sec-8thPlenary-Oct1",
    "sec-9thPlenary-Oct2",
    "sec-BermudaConf-Dec4-8",
    "sec-DE-Sept29",
    "sec-DEM-Oct3",
    "sec-DEMMeeting-Oct23",
    "sec-DEMeeting-Dec6-7",
    "sec-DEMeeting-Oct20",
    "sec-DEMeeting-Oct21",
    "sec-DEMeeting-Oct23",
    "sec-DM-Oct3",
    "sec-DMFMeeting-Sept29",
    "sec-DMMeeting-Oct20",
    "sec-DMMeeting-Oct22",
    "sec-DTMeeting-Sept30",
    "sec-EBMeeting-Dec7",
    "sec-ECDinnerMeeting-Dec5",
    "sec-ECMeeting-Dec4",
    "sec-ECMeeting-Dec5",
    "sec-ELMeeting-Dec5",
    "sec-FMeeting-Oct21",
    "sec-Feb-14-mtg3",
    "sec-Feb13",
    "sec-Feb13-mtg1",
    "sec-Feb14",
    "sec-Feb14-mtg1",
    "sec-Feb14-mtg2",
    "sec-Feb15",
    "sec-Feb16",
    "sec-Feb16-mtg1",
    "sec-Feb17",
    "sec-Feb17-mtg1",
    "sec-Feb17-mtg2",
    "sec-Feb18",
    "sec-Feb18-mtg1",
    "sec-Feb18-mtg2",
    "sec-Feb18-mtg3",
    "sec-Feb18-mtg4",
    "sec-Feb19",
    "sec-Feb19-mtg1",
    "sec-Feb19-mtg2",
    "sec-Feb19-mtg3",
    "sec-Feb20",
    "sec-Feb20-mtg1",
    "sec-Feb20-mtg2",
    "sec-Feb20-mtg3",
    "sec-Feb21",
    "sec-Feb21-mtg1",
    "sec-Feb21-mtg2",
    "sec-Feb21-mtg3",
    "sec-Feb21-mtg4",
    "sec-Feb21-mtg5",
    "sec-Feb21-mtg6",
    "sec-Feb21-mtg7",
    "sec-Feb21-mtg8",
    "sec-Feb22",
    "sec-Feb22-mtg1",
    "sec-Feb22-mtg2",
    "sec-Feb22-mtg3",
    "sec-Feb22-mtg4",
    "sec-Feb23",
    "sec-Feb23-mtg1",
    "sec-Feb23-mtg2",
    "sec-Feb23-mtg3",
    "sec-Feb23-mtg4",
    "sec-Feb24",
    "sec-Feb24-mtg1",
    "sec-Feb25",
    "sec-Feb25-mtg1",
    "sec-Feb25-mtg2",
    "sec-Feb26",
    "sec-Feb26-mtg1",
    "sec-Feb26-mtg2",
    "sec-Feb26-mtg3",
    "sec-MLMeeting-Dec4",
    "sec-MeetingAssociatedStates-July13",
    "sec-NAMeeting-Oct22",
    "sec-SigningCeremonies-Oct23",
    "sec-SigningCeremony-Oct3",
    "sec-TripartiteFM-Dec4",
    "sec-TripartiteMeeting-July11",
    "sec-TripartiteWorkingGp-Dec5",
    "section",
    "shorttitles",
    "source",
    "sources",
    "subjects",
    "subseriesvols",
    "summary",
    "symbols",
    "terms",
    "toc",
    "toc-countries",
    "toc-papers",
    "toc-topics",
    "topical",
    "translation-of-the-memorandum",
    "treaties",
    "united-states",
    "volumes",
    "volumesummary"
]
WaxCylinderRevival commented 6 years ago
  1. I wasn't sure if we needed to indicate that these dates didn't have strict human review, but I'm open to changing the div/@ana to "#date_undated-inferred-from-volume-rules", if you think it best.

  2. I do think "Inferred Date Range: {date-min}-{date-max}" and "Date methodology": "Inferred using volume rules." could potentially work across documents and sections, etc. (as single dates have an inferred date range, for example).

  3. Ha, I had run a similar query for distinct values and was trying to resist normalization (but the "abouttheseries" and "AbouttheSeries" taunts, @joewiz) . I do think, though, there is a case to be made for adding a @subtype to div[attribute::type eq "section"], grouping Into logical categories based on current @xml:id (subtype="appendix","event", etc.), and then using the subtypes to determine the date rules to apply.

WaxCylinderRevival commented 6 years ago

See commits under this pull request: https://github.com/HistoryAtState/frus/pull/197

WaxCylinderRevival commented 6 years ago
div/@subtype Frequency (as of 2018-05-08)
about-frus-series 27
acknowledgements 2
additional-volumes 37
appendix 14
chapter-introduction 8
editorial-note 8301
editorial-policies 4
errata 38
errata_document-numbering-error 2
event 38
graphic-materials 10
historical-document 276062
index 1130
maps 1
notes 45
preface 378
press-release 44
referral 979
related-materials 2
section 2
sources 234
subsection 85
table-of-contents 444
undetermined 5
volume-summary 20