Open sonofmun opened 7 years ago
Hey @sonofmun I rather have a new node for formatted metadata :
<structured-metadata xmlns="http://capitains.github.io/xmlns">
<dc:author></dc:author>
<dc:title></dc:title>
<dc:editor></dc:editor>
<dc:publisher></dc:publisher>
<dc:year></dc:year>
</structured-metadata>
But with correst DC and DCTerms metadata (Up to @AlisonBabeu to decide what subset we should use I guess). Ultimately, would power for sure this kind of API https://memory.loc.gov/cgi-bin/oai2_0?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lcoa1.loc.gov:loc.gmd/g3791p.rr002300
That works for me. Just need to make sure to add @xmlns:dc="http://purl.org/dc/elements/1.1/" to the root element. Oxygen is also OK with it.
Let's wait for @AlisonBabeu and decide on the subset to have. Then I'll also publish it on capitains :)
Here is an example. The subset, etc., can be tweeked:
<ti:work xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:ti="http://chs.harvard.edu/xmlns/cts" xmlns="http://capitains.github.io/xmlns" groupUrn="urn:cts:greekLit:tlg4102" xml:lang="grc" urn="urn:cts:greekLit:tlg4102.tlg045">
<ti:title xml:lang="lat">Catena In Epistulam Juda</ti:title>
<ti:edition urn="urn:cts:greekLit:tlg4102.tlg045.opp-grc1" workUrn="urn:cts:greekLit:tlg4102.tlg045">
<ti:label xml:lang="lat">Catena In Epistulam Juda</ti:label>
<ti:description xml:lang="lat">Catenae (Novum Testamentum), Catena In Epistulam Juda, J. A. Cramer, S.T.P., Oxford University Press, 1840</ti:description>
<structured-metadata xlmns="http://capitains.github.io/xmlns">
<dc:author>Catenae (Novum Testamentum)</dc:author>
<dc:title>Catena In Epistulam Juda</dc:title>
<dc:editor>J. A. Cramer, S.T.P.</dc:editor>
<dc:publisher>Oxford University Press</dc:publisher>
<dc:year>1840</dc:year>
</structured-metadata>
</ti:edition>
</ti:work>
Should the urn for the <dc:author>
and the <dc:title>
be explicitly named? Or it is enough to have these in the @groupUrn and @workUrn elements?
Shoot! I notice now that the @xmlns attribute didn't show up on the <structured-metadata>
element. Still needs some tweeking.
I have now tweeked the script and the @xmlns now shows up on the <structured-metadata>
element. Although I also see that it shows up on the root tag. Is that a problem?
Try this to avoid weird situation
<ti:work xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:ti="http://chs.harvard.edu/xmlns/cts" xmlns:cpt="http://capitains.github.io/xmlns" groupUrn="urn:cts:greekLit:tlg4102" xml:lang="grc" urn="urn:cts:greekLit:tlg4102.tlg045">
<ti:title xml:lang="lat">Catena In Epistulam Juda</ti:title>
<ti:edition urn="urn:cts:greekLit:tlg4102.tlg045.opp-grc1" workUrn="urn:cts:greekLit:tlg4102.tlg045">
<ti:label xml:lang="lat">Catena In Epistulam Juda</ti:label>
<ti:description xml:lang="lat">Catenae (Novum Testamentum), Catena In Epistulam Juda, J. A. Cramer, S.T.P., Oxford University Press, 1840</ti:description>
<cpt:structured-metadata xlmns="http://capitains.github.io/xmlns">
<dc:author>Catenae (Novum Testamentum)</dc:author>
<dc:title>Catena In Epistulam Juda</dc:title>
<dc:editor>J. A. Cramer, S.T.P.</dc:editor>
<dc:publisher>Oxford University Press</dc:publisher>
<dc:year>1840</dc:year>
</cpt:structured-metadata>
</ti:edition>
</ti:work>
Note : do you have any kind of subject information ? Like commentary, bible or whatever ?
We don't have any kind of information like that in the headers. I expect that we have them, though, in the Perseus catalog.
Also, if you have <cpt:structured-metadata>
, then you don't need the @xmlns attribute, do you?
I mean, that the namespace for "cpt" was already identified in the root element, so I don't think you would need to restate it on the <cpt:structured-metadata>
element.
the point is to not have namespaces without prefixe, hence using xmlns:cpt. Making clear namespaces and avoiding redefinition of NS :)
So I think this structured metadata has all of the necessary elements and I am happy to follow whatever you guys feel is needed. I second the importance of namespaces and in terms of subjects, @sonofmun is right the works in the Perseus Catalog have them because most of the MODS records I downloaded for editions already had subject headings assigned. Any of these editions we are working with now likely has had subject headings given to them as well which we could certainly use, but I'm not sure we need them. Did I miss any questions? :)
So this is what I would suggest. Take a close look at namespace declarations, etc., and let me know if you have any suggestions:
<ti:work xmlns:cpt="http://capitains.github.io/xmlns" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:ti="http://chs.harvard.edu/xmlns/cts" groupUrn="urn:cts:greekLit:tlg4102" urn="urn:cts:greekLit:tlg4102.tlg045" xml:lang="grc">
<ti:title xml:lang="lat">Catena In Epistulam Juda</ti:title>
<ti:edition urn="urn:cts:greekLit:tlg4102.tlg045.opp-grc1" workUrn="urn:cts:greekLit:tlg4102.tlg045">
<ti:label xml:lang="lat">Catena In Epistulam Juda</ti:label>
<ti:description xml:lang="lat">Catenae (Novum Testamentum), Catena In Epistulam Juda, J. A. Cramer, S.T.P., Oxford University Press, 1840</ti:description>
<cpt:structured-metadata>
<dc:author>Catenae (Novum Testamentum)</dc:author>
<dc:title>Catena In Epistulam Juda</dc:title>
<dc:editor>J. A. Cramer, S.T.P.</dc:editor>
<dc:publisher>Oxford University Press</dc:publisher>
<dc:year>1840</dc:year>
</cpt:structured-metadata>
</ti:edition>
</ti:work>
I am not sure dc:author and dc:year are correct. It should probably be dc:creator and dc:date. See http://dublincore.org/documents/dces/ also for the date format. Maybe created from the dc terms (another namespace but fine by me) ? http://dublincore.org/documents/dcmi-terms/#terms-abstract
Am I right in assuming, judging from the discussion in #1001, that xml:lang
should be "grc" instead of "lat", since the text is Ancient Greek, even if the title is given in Latin?
Which xml:lang ? :)
The last one (description) at least. But I actually thought all of them. I guess I didn't think that one through completely. Question is what these attributes are going to be used for.
On <ti:title>
, for instance, the @xml:lang attribute should correspond to the language in which the title is given. Here it is about that single element, <ti:title>
. The <ti:work>, <ti:edition>
and <ti:translaton>
actually describe the text itself, and so the @xml:lang attribute there should correspond to the language of the text we are talking about, with "edition" and "translation" reflecting the language of the edition or translation we are referring to in those elements and "work" referring to the original language of the work, in this case "grc". Does that sound correct, @PonteIneptique ?
And <ti:description>
is tough because it could include a string put together from several different languages, e.g., author in Latin and publisher in German. So I would vote for no @xml:lang attribute here. Although I don't think that is probably best practice since <ti:description>
should actually inherit the language tag of its closest ancestor that has a language attribute, in this case from <ti:edition>
, which probably will not be a good description of the <ti:description>
's language.
And <ti:label>
? Same as
Exactly. Since I want to automatically create the __cts__.xml
files from the information in the teiHeader, what this means is that we will need to start explicitly marking the @xml:lang of at least the work's title and author in the teiHeader if it is not in English (which I will use as the default).
Do we have to add anything to langUsage
then?
But the advantage of automatic creation of the metadata files is that we can concentrate on getting the information right in the teiHeader and (almost) completely ignore the metadata files.
And I think we need to add <language ident="lat">Latin</language>
to <langUsage>
since I think the accepted code is 'la'. But I am not so sure about this really. That is, of course, if we use @xml:lang="lat" anywhere in the file.
Agree with @sonofmun except that even if mixture, there should be a language. just put something such as "mul"
Header as in titleStmt or rather sourceDesc? I would say the former, since in sourceDesc I remember instances where these gave the name of the book rather than the title of the following text (tlg0062).
Yes, in titleStmt. The sourceDesc describes the volume, not necessarily the work.
So many great ideas so many questions, @PonteIneptique you are quite right in that the correct terms for Dublin Core are dc:creator and dc:date not dc:author and dc:year. Also @sonofmun I think you are quite right that we want to get all the data correct in the TEI header, both the titleStmt and the sourceDesc, so that the cts_xml files can be automatically generated. The whole language issue came about because until the Methodius works I had only been editing typos such as mispelled author or work names and had ignored the language issue.
@PonteIneptique I did not see any reference to the dc:date format at that link (perhaps I did not look hard enough). And I also am not sure what you mean by the reference to dc:terms. Could you give an example of what you mean?
For the dcterms namespace (different from dc I think @AlisonBabeu ?), you got http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms#created which is relevant to our situation (publication time).
Date format : https://www.w3.org/TR/NOTE-datetime and you're actually good with year only.
I am thinking it would be great to have a structured-metadata/created
also on the work (with the year thought of production ?) when we have it (catalog ??) and two structured-metadata/date on the textgroup as well ?
It looks like the namespace for dcterms is http://purl.org/dc/terms/, so that is different from plain dc.
New proposal (there is no <structured-metadata>
here for the <work>
).
<ti:work xmlns:cpt="http://capitains.github.io/xmlns" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:dct="http://purl.org/dc/terms/" xmlns:ti="http://chs.harvard.edu/xmlns/cts" groupUrn="urn:cts:greekLit:tlg4102" xml:lang="grc" urn="urn:cts:greekLit:tlg4102.tlg045">
<ti:title xml:lang="lat">Catena In Epistulam Juda</ti:title>
<ti:edition xml:lang="grc" urn="urn:cts:greekLit:tlg4102.tlg045.opp-grc1" workUrn="urn:cts:greekLit:tlg4102.tlg045">
<ti:label xml:lang="lat">Catena In Epistulam Juda</ti:label>
<ti:description xml:lang="mul">Catenae (Novum Testamentum), Catena In Epistulam Juda, J. A. Cramer, S.T.P., Oxford University Press, 1840</ti:description>
<cpt:structured-metadata>
<dc:creator xml:lang="lat">Catenae (Novum Testamentum)</dc:creator>
<dc:title xml:lang="lat">Catena In Epistulam Juda</dc:title>
<dc:editor xml:lang="lat">J. A. Cramer, S.T.P.</dc:editor>
<dc:publisher xml:lang="eng">Oxford University Press</dc:publisher>
<dct:created>1840</dct:created>
</cpt:structured-metadata>
</ti:edition>
</ti:work>
The @xml:lang
on the <dc:editor>
is wrong, I think (it should be "eng"), but that should be corrected in the teiHeader of the XML file. Comments once more?
Hmm. Looking at this structured-metadata, I wonder if <dc:creator>
and <dct:created>
shouldn't refer both to the edition here. My problem is that if they are related, then created should be the time at which creator created the work. And that is here not the case. created is when the editor or the publisher created the work.
Seems good. Other things we might want to add (at some point) (DC being the related correct namespace)
Any addendum ?
Good, I think that answers my question from the previous comment. I think I would change <dct:created>
to <dc:copyrighted>
on the <edition>
or <translation>
and save <dct:created>
(or <dc:created>
?) for the work itself.
Note that technically, I think created refers to the creation of the FRBR element represented by translation or version, not the work itself. And it is not even clear if it should not be the date of publication of the xml.
ie.
@AlisonBabeu you're our savior here.
This raises an interesting question for me: to what entity does this structured-metadata refer? Does it refer to the XML resource that we are publishing? If so, then all of this information needs to be changed. The publication information for the XML file is in the teiHeader, though, so this would seem redundant. Or does it refer to the edition that is the source for the XML file we are publishing? If so, then I don't think we would want to include <dc:provenance>
unless we want to talk about the library from which the scan was taken.
The second (that it describes the edition that is the source of the XML) makes more sense to me and it is certainly what we have been doing and, I think, what we need to continue to do. Thus, if we discover that using DC as we have done here should be describing the XML file and not the edition that it comes from, we obviously need to change how we do the <structured-metadata>
.
@AlisonBabeu's reaction :
BTW, I don't see editor or copyrighted as tags at http://dublincore.org/documents/dces/ or http://dublincore.org/documents/2012/06/14/dcmi-terms/. Am I looking in the wrong place? Or do we need to figure out different tags for these (e.g., <dct:dateCopyrighted>
)?
my bad : it seems to be dateCopyrighted. And editor should be contributor, you're right... I did not check the editor
Is there any way to make clear that the entity in the <dc(t):contributor>
element is the editor, e.g., with a @type
attribute? It looks like we can have @type
on contributor if I am reading this correctly: https://www.w3.org/1999/02/22-rdf-syntax-ns#Property
I think it's time to stop and let the librarian smash us with her knowledge.
OK, I will stop now. @AlisonBabeu Please save us!
Well first off gentlemen, I'm not in the business of saving anyone or smashing anyone. I've almost lost track of the questions at this point, I wish you could respond to specific comments at any point in the comment stream not just at the bottom.
But, aside from the fact, that I've now been compared to an ancient evil Sith Lord and I haven't even had my second cup of coffee, here are some thoughts:
1) One of the biggest "problems" with DC is that is has always been designed to be as simple as possible so many of the struggles you are having getting the right elements to match up with the data is that there might not be the right elements available. This is one reason I used MODS for the catalog records, it allowed a lot more descriptive nuance.
2) I agree with Matt in that I think we need to use the structured metadata to describe the edition that is represented within the XML file, within reason that is of course. A lot of the Dublin Core metadata has been intended to describe the XML file itself as a way of preserving provenance and other data within library systems.
3) You are right, dc:created == date of "publication" of the XML dc:copyrighted == date of the publication of the book. Of course this also raises the challenge that most of the editions we use aren't even copyrighted! So technically this is incorrect. Dates are purposely vague within Dublin Core, in fact if you look at the MODS to DC mapping (http://www.loc.gov/standards/mods/mods-dcsimple.html), over five different date elements all map to dc:date. Perhaps we just need to add attributes to dc:date to get at the information we want for things like date created vs. date published. The date created information in terms of works in not currently available in the Perseus Catalog though Greg has lone wanted it there.
4) Perhaps we should discuss some of this during the staff meeting, or another meeting between the three of us, @sonofmun and @PonteIneptique
Completely agree with meeting. I propose you find a day next week and I'll comply ;)
Great! Thanks a lot, @AlisonBabeu And remember, it was @PonteIneptique that compared you to Palpatine, not me.
I would suggest a meeting among the three of us, though that means it will have to wait until next week since @PonteIneptique won't be available before next Monday. If we can put @type
attributes on the <dc:date>
and <dc:contributor>
elements, I think that would give us a lot what we need.
I'm happy to meet perhaps on Tuesday before the staff meeting like last time, that could work for me. Also, in terms of the dc schema, I don't think unfortunately that it allows attributes on the element dc:date or dc:contributor for that matter.
OK. So Tuesday at 4pm CET/10am Eastern time?
Hey @sonofmun and @PonteIneptique as much as it seems impossible I forgot about a long scheduled appointment for tomorrow at 4 pm CET/10 am Eastern time and Tuesday as wel. Could we do Wednesday or Friday at 10 am? Or after the staff meeting?
Wednesday would be great, after staff meeting I'll be at a beer tasting :D
How about tomorrow morning, but not at 10 am I'm realizing now, how about earlier, say 8 am my time and 2 pm your time? Sorry for all the craziness.
@planatheisa @PonteIneptique @AlisonBabeu I am getting ready to do a wholesale change of all the metadata files here to bring the information from the headers of the XML files automatically into the
__cts__.xml
metadata files. I think when doing this, we should structure the<description>
element of the editions better. I would suggest the following:Oxygen says that this is legal and this would help us to solve several problems besides just having poorly structured metadata in the
<description>
element. For instance, we could have different @xml:lang atrributes when, e.g., the title is in Latin and the publisher in German. Does anyone have anything against this schema?