Metadata files - Githubissues

sonofmun commented 7 years ago

@planatheisa @PonteIneptique @AlisonBabeu I am getting ready to do a wholesale change of all the metadata files here to bring the information from the headers of the XML files automatically into the __cts__.xml metadata files. I think when doing this, we should structure the <description> element of the editions better. I would suggest the following:

<ti:description>
  <ti:author></ti:author>
  <ti:title></ti:title>
  <ti:editor></ti:editor>
  <ti:publisher></ti:publisher>
  <ti:year></ti:year>
</ti:description>

Oxygen says that this is legal and this would help us to solve several problems besides just having poorly structured metadata in the <description> element. For instance, we could have different @xml:lang atrributes when, e.g., the title is in Latin and the publisher in German. Does anyone have anything against this schema?

PonteIneptique commented 7 years ago

Hey @sonofmun I rather have a new node for formatted metadata :

<structured-metadata xmlns="http://capitains.github.io/xmlns">
  <dc:author></dc:author>
  <dc:title></dc:title>
  <dc:editor></dc:editor>
  <dc:publisher></dc:publisher>
  <dc:year></dc:year>
</structured-metadata>

But with correst DC and DCTerms metadata (Up to @AlisonBabeu to decide what subset we should use I guess). Ultimately, would power for sure this kind of API https://memory.loc.gov/cgi-bin/oai2_0?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lcoa1.loc.gov:loc.gmd/g3791p.rr002300

sonofmun commented 7 years ago

That works for me. Just need to make sure to add @xmlns:dc="http://purl.org/dc/elements/1.1/" to the root element. Oxygen is also OK with it.

PonteIneptique commented 7 years ago

Let's wait for @AlisonBabeu and decide on the subset to have. Then I'll also publish it on capitains :)

sonofmun commented 7 years ago

Here is an example. The subset, etc., can be tweeked:

<ti:work xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:ti="http://chs.harvard.edu/xmlns/cts" xmlns="http://capitains.github.io/xmlns" groupUrn="urn:cts:greekLit:tlg4102" xml:lang="grc" urn="urn:cts:greekLit:tlg4102.tlg045">
  <ti:title xml:lang="lat">Catena In Epistulam Juda</ti:title>
  <ti:edition urn="urn:cts:greekLit:tlg4102.tlg045.opp-grc1" workUrn="urn:cts:greekLit:tlg4102.tlg045">
    <ti:label xml:lang="lat">Catena In Epistulam Juda</ti:label>
    <ti:description xml:lang="lat">Catenae (Novum Testamentum), Catena In Epistulam Juda, J. A. Cramer, S.T.P., Oxford University Press, 1840</ti:description>
    <structured-metadata xlmns="http://capitains.github.io/xmlns">
      <dc:author>Catenae (Novum Testamentum)</dc:author>
      <dc:title>Catena In Epistulam Juda</dc:title>
      <dc:editor>J. A. Cramer, S.T.P.</dc:editor>
      <dc:publisher>Oxford University Press</dc:publisher>
      <dc:year>1840</dc:year>
    </structured-metadata>
  </ti:edition>
</ti:work>

Should the urn for the <dc:author> and the <dc:title> be explicitly named? Or it is enough to have these in the @groupUrn and @workUrn elements?

sonofmun commented 7 years ago

Shoot! I notice now that the @xmlns attribute didn't show up on the <structured-metadata> element. Still needs some tweeking.

sonofmun commented 7 years ago

I have now tweeked the script and the @xmlns now shows up on the <structured-metadata> element. Although I also see that it shows up on the root tag. Is that a problem?

PonteIneptique commented 7 years ago

Try this to avoid weird situation

<ti:work xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:ti="http://chs.harvard.edu/xmlns/cts" xmlns:cpt="http://capitains.github.io/xmlns" groupUrn="urn:cts:greekLit:tlg4102" xml:lang="grc" urn="urn:cts:greekLit:tlg4102.tlg045">
  <ti:title xml:lang="lat">Catena In Epistulam Juda</ti:title>
  <ti:edition urn="urn:cts:greekLit:tlg4102.tlg045.opp-grc1" workUrn="urn:cts:greekLit:tlg4102.tlg045">
    <ti:label xml:lang="lat">Catena In Epistulam Juda</ti:label>
    <ti:description xml:lang="lat">Catenae (Novum Testamentum), Catena In Epistulam Juda, J. A. Cramer, S.T.P., Oxford University Press, 1840</ti:description>
    <cpt:structured-metadata xlmns="http://capitains.github.io/xmlns">
      <dc:author>Catenae (Novum Testamentum)</dc:author>
      <dc:title>Catena In Epistulam Juda</dc:title>
      <dc:editor>J. A. Cramer, S.T.P.</dc:editor>
      <dc:publisher>Oxford University Press</dc:publisher>
      <dc:year>1840</dc:year>
    </cpt:structured-metadata>
  </ti:edition>
</ti:work>

PonteIneptique commented 7 years ago

Note : do you have any kind of subject information ? Like commentary, bible or whatever ?

sonofmun commented 7 years ago

We don't have any kind of information like that in the headers. I expect that we have them, though, in the Perseus catalog. Also, if you have <cpt:structured-metadata>, then you don't need the @xmlns attribute, do you?

sonofmun commented 7 years ago

I mean, that the namespace for "cpt" was already identified in the root element, so I don't think you would need to restate it on the <cpt:structured-metadata> element.

PonteIneptique commented 7 years ago

the point is to not have namespaces without prefixe, hence using xmlns:cpt. Making clear namespaces and avoiding redefinition of NS :)

AlisonBabeu commented 7 years ago

So I think this structured metadata has all of the necessary elements and I am happy to follow whatever you guys feel is needed. I second the importance of namespaces and in terms of subjects, @sonofmun is right the works in the Perseus Catalog have them because most of the MODS records I downloaded for editions already had subject headings assigned. Any of these editions we are working with now likely has had subject headings given to them as well which we could certainly use, but I'm not sure we need them. Did I miss any questions? :)

sonofmun commented 7 years ago

So this is what I would suggest. Take a close look at namespace declarations, etc., and let me know if you have any suggestions:

<ti:work xmlns:cpt="http://capitains.github.io/xmlns" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:ti="http://chs.harvard.edu/xmlns/cts" groupUrn="urn:cts:greekLit:tlg4102" urn="urn:cts:greekLit:tlg4102.tlg045" xml:lang="grc">
  <ti:title xml:lang="lat">Catena In Epistulam Juda</ti:title>
  <ti:edition urn="urn:cts:greekLit:tlg4102.tlg045.opp-grc1" workUrn="urn:cts:greekLit:tlg4102.tlg045">
    <ti:label xml:lang="lat">Catena In Epistulam Juda</ti:label>
    <ti:description xml:lang="lat">Catenae (Novum Testamentum), Catena In Epistulam Juda, J. A. Cramer, S.T.P., Oxford University Press, 1840</ti:description>
    <cpt:structured-metadata>
      <dc:author>Catenae (Novum Testamentum)</dc:author>
      <dc:title>Catena In Epistulam Juda</dc:title>
      <dc:editor>J. A. Cramer, S.T.P.</dc:editor>
      <dc:publisher>Oxford University Press</dc:publisher>
      <dc:year>1840</dc:year>
    </cpt:structured-metadata>
  </ti:edition>
</ti:work>

PonteIneptique commented 7 years ago

I am not sure dc:author and dc:year are correct. It should probably be dc:creator and dc:date. See http://dublincore.org/documents/dces/ also for the date format. Maybe created from the dc terms (another namespace but fine by me) ? http://dublincore.org/documents/dcmi-terms/#terms-abstract

planatheisa commented 7 years ago

Am I right in assuming, judging from the discussion in #1001, that xml:lang should be "grc" instead of "lat", since the text is Ancient Greek, even if the title is given in Latin?

PonteIneptique commented 7 years ago

Which xml:lang ? :)

planatheisa commented 7 years ago

The last one (description) at least. But I actually thought all of them. I guess I didn't think that one through completely. Question is what these attributes are going to be used for.

sonofmun commented 7 years ago

On <ti:title>, for instance, the @xml:lang attribute should correspond to the language in which the title is given. Here it is about that single element, <ti:title>. The <ti:work>, <ti:edition> and <ti:translaton> actually describe the text itself, and so the @xml:lang attribute there should correspond to the language of the text we are talking about, with "edition" and "translation" reflecting the language of the edition or translation we are referring to in those elements and "work" referring to the original language of the work, in this case "grc". Does that sound correct, @PonteIneptique ?

sonofmun commented 7 years ago

And <ti:description> is tough because it could include a string put together from several different languages, e.g., author in Latin and publisher in German. So I would vote for no @xml:lang attribute here. Although I don't think that is probably best practice since <ti:description> should actually inherit the language tag of its closest ancestor that has a language attribute, in this case from <ti:edition>, which probably will not be a good description of the <ti:description>'s language.

planatheisa commented 7 years ago

And <ti:label>? Same as , so the language of the label (which is I guess the title of the specific edition)?

sonofmun commented 7 years ago

Exactly. Since I want to automatically create the __cts__.xml files from the information in the teiHeader, what this means is that we will need to start explicitly marking the @xml:lang of at least the work's title and author in the teiHeader if it is not in English (which I will use as the default).

planatheisa commented 7 years ago

Do we have to add anything to langUsage then?

sonofmun commented 7 years ago

But the advantage of automatic creation of the metadata files is that we can concentrate on getting the information right in the teiHeader and (almost) completely ignore the metadata files. And I think we need to add <language ident="lat">Latin</language> to <langUsage> since I think the accepted code is 'la'. But I am not so sure about this really. That is, of course, if we use @xml:lang="lat" anywhere in the file.

PonteIneptique commented 7 years ago

Agree with @sonofmun except that even if mixture, there should be a language. just put something such as "mul"

planatheisa commented 7 years ago

Header as in titleStmt or rather sourceDesc? I would say the former, since in sourceDesc I remember instances where these gave the name of the book rather than the title of the following text (tlg0062).

sonofmun commented 7 years ago

Yes, in titleStmt. The sourceDesc describes the volume, not necessarily the work.

AlisonBabeu commented 7 years ago

So many great ideas so many questions, @PonteIneptique you are quite right in that the correct terms for Dublin Core are dc:creator and dc:date not dc:author and dc:year. Also @sonofmun I think you are quite right that we want to get all the data correct in the TEI header, both the titleStmt and the sourceDesc, so that the cts_xml files can be automatically generated. The whole language issue came about because until the Methodius works I had only been editing typos such as mispelled author or work names and had ignored the language issue.

sonofmun commented 7 years ago

@PonteIneptique I did not see any reference to the dc:date format at that link (perhaps I did not look hard enough). And I also am not sure what you mean by the reference to dc:terms. Could you give an example of what you mean?

PonteIneptique commented 7 years ago

For the dcterms namespace (different from dc I think @AlisonBabeu ?), you got http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms#created which is relevant to our situation (publication time).

Date format : https://www.w3.org/TR/NOTE-datetime and you're actually good with year only.

I am thinking it would be great to have a structured-metadata/created also on the work (with the year thought of production ?) when we have it (catalog ??) and two structured-metadata/date on the textgroup as well ?

sonofmun commented 7 years ago

It looks like the namespace for dcterms is http://purl.org/dc/terms/, so that is different from plain dc.

sonofmun commented 7 years ago

New proposal (there is no <structured-metadata> here for the <work>).

<ti:work xmlns:cpt="http://capitains.github.io/xmlns" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:dct="http://purl.org/dc/terms/" xmlns:ti="http://chs.harvard.edu/xmlns/cts" groupUrn="urn:cts:greekLit:tlg4102" xml:lang="grc" urn="urn:cts:greekLit:tlg4102.tlg045">
  <ti:title xml:lang="lat">Catena In Epistulam Juda</ti:title>
  <ti:edition xml:lang="grc" urn="urn:cts:greekLit:tlg4102.tlg045.opp-grc1" workUrn="urn:cts:greekLit:tlg4102.tlg045">
    <ti:label xml:lang="lat">Catena In Epistulam Juda</ti:label>
    <ti:description xml:lang="mul">Catenae (Novum Testamentum), Catena In Epistulam Juda, J. A. Cramer, S.T.P., Oxford University Press, 1840</ti:description>
    <cpt:structured-metadata>
      <dc:creator xml:lang="lat">Catenae (Novum Testamentum)</dc:creator>
      <dc:title xml:lang="lat">Catena In Epistulam Juda</dc:title>
      <dc:editor xml:lang="lat">J. A. Cramer, S.T.P.</dc:editor>
      <dc:publisher xml:lang="eng">Oxford University Press</dc:publisher>
      <dct:created>1840</dct:created>
    </cpt:structured-metadata>
  </ti:edition>
</ti:work>

The @xml:lang on the <dc:editor> is wrong, I think (it should be "eng"), but that should be corrected in the teiHeader of the XML file. Comments once more?

sonofmun commented 7 years ago

Hmm. Looking at this structured-metadata, I wonder if <dc:creator> and <dct:created> shouldn't refer both to the edition here. My problem is that if they are related, then created should be the time at which creator created the work. And that is here not the case. created is when the editor or the publisher created the work.

PonteIneptique commented 7 years ago

Seems good. Other things we might want to add (at some point) (DC being the related correct namespace)

edition, translation
- dc:license
- dc:description
- dc:identifier -> worldcat record ?
- dc:source for the scan
- dc:copyrighted for the date of publication
- dc:provenance with "Encoded and published by University of Leipzig" or something
- dc:language (including those of notes and metadata)
work
- multiple langs for title
- dc:subject
- dc:created for the date of creation
- dc:description
- dc:identifier (worldcat ?)
textgroup
- dc:identifier for other identifiers

Any addendum ?

sonofmun commented 7 years ago

Good, I think that answers my question from the previous comment. I think I would change <dct:created> to <dc:copyrighted> on the <edition> or <translation> and save <dct:created> (or <dc:created>?) for the work itself.

PonteIneptique commented 7 years ago

Note that technically, I think created refers to the creation of the FRBR element represented by translation or version, not the work itself. And it is not even clear if it should not be the date of publication of the xml.

ie.

dc:created == date of "publication" of the XML
dc:copyrighted == date of the publication of the book

@AlisonBabeu you're our savior here.

sonofmun commented 7 years ago

This raises an interesting question for me: to what entity does this structured-metadata refer? Does it refer to the XML resource that we are publishing? If so, then all of this information needs to be changed. The publication information for the XML file is in the teiHeader, though, so this would seem redundant. Or does it refer to the edition that is the source for the XML file we are publishing? If so, then I don't think we would want to include <dc:provenance> unless we want to talk about the library from which the scan was taken. The second (that it describes the edition that is the source of the XML) makes more sense to me and it is certainly what we have been doing and, I think, what we need to continue to do. Thus, if we discover that using DC as we have done here should be describing the XML file and not the edition that it comes from, we obviously need to change how we do the <structured-metadata>.

PonteIneptique commented 7 years ago

@AlisonBabeu's reaction :

sonofmun commented 7 years ago

BTW, I don't see editor or copyrighted as tags at http://dublincore.org/documents/dces/ or http://dublincore.org/documents/2012/06/14/dcmi-terms/. Am I looking in the wrong place? Or do we need to figure out different tags for these (e.g., <dct:dateCopyrighted>)?

PonteIneptique commented 7 years ago

my bad : it seems to be dateCopyrighted. And editor should be contributor, you're right... I did not check the editor

sonofmun commented 7 years ago

Is there any way to make clear that the entity in the <dc(t):contributor> element is the editor, e.g., with a @type attribute? It looks like we can have @type on contributor if I am reading this correctly: https://www.w3.org/1999/02/22-rdf-syntax-ns#Property

PonteIneptique commented 7 years ago

I think it's time to stop and let the librarian smash us with her knowledge.

sonofmun commented 7 years ago

OK, I will stop now. @AlisonBabeu Please save us!

AlisonBabeu commented 7 years ago

Well first off gentlemen, I'm not in the business of saving anyone or smashing anyone. I've almost lost track of the questions at this point, I wish you could respond to specific comments at any point in the comment stream not just at the bottom.

But, aside from the fact, that I've now been compared to an ancient evil Sith Lord and I haven't even had my second cup of coffee, here are some thoughts:

1) One of the biggest "problems" with DC is that is has always been designed to be as simple as possible so many of the struggles you are having getting the right elements to match up with the data is that there might not be the right elements available. This is one reason I used MODS for the catalog records, it allowed a lot more descriptive nuance.

2) I agree with Matt in that I think we need to use the structured metadata to describe the edition that is represented within the XML file, within reason that is of course. A lot of the Dublin Core metadata has been intended to describe the XML file itself as a way of preserving provenance and other data within library systems.

3) You are right, dc:created == date of "publication" of the XML dc:copyrighted == date of the publication of the book. Of course this also raises the challenge that most of the editions we use aren't even copyrighted! So technically this is incorrect. Dates are purposely vague within Dublin Core, in fact if you look at the MODS to DC mapping (http://www.loc.gov/standards/mods/mods-dcsimple.html), over five different date elements all map to dc:date. Perhaps we just need to add attributes to dc:date to get at the information we want for things like date created vs. date published. The date created information in terms of works in not currently available in the Perseus Catalog though Greg has lone wanted it there.

4) Perhaps we should discuss some of this during the staff meeting, or another meeting between the three of us, @sonofmun and @PonteIneptique

PonteIneptique commented 7 years ago

Completely agree with meeting. I propose you find a day next week and I'll comply ;)

sonofmun commented 7 years ago

Great! Thanks a lot, @AlisonBabeu And remember, it was @PonteIneptique that compared you to Palpatine, not me. I would suggest a meeting among the three of us, though that means it will have to wait until next week since @PonteIneptique won't be available before next Monday. If we can put @type attributes on the <dc:date> and <dc:contributor> elements, I think that would give us a lot what we need.

AlisonBabeu commented 7 years ago

I'm happy to meet perhaps on Tuesday before the staff meeting like last time, that could work for me. Also, in terms of the dc schema, I don't think unfortunately that it allows attributes on the element dc:date or dc:contributor for that matter.

sonofmun commented 7 years ago

OK. So Tuesday at 4pm CET/10am Eastern time?

AlisonBabeu commented 7 years ago

Hey @sonofmun and @PonteIneptique as much as it seems impossible I forgot about a long scheduled appointment for tomorrow at 4 pm CET/10 am Eastern time and Tuesday as wel. Could we do Wednesday or Friday at 10 am? Or after the staff meeting?

PonteIneptique commented 7 years ago

Wednesday would be great, after staff meeting I'll be at a beer tasting :D

AlisonBabeu commented 7 years ago

How about tomorrow morning, but not at 10 am I'm realizing now, how about earlier, say 8 am my time and 2 pm your time? Sorry for all the craziness.

OpenGreekAndLatin / First1KGreek

Metadata files #1014