dracor-org / dracor-api

eXistdb application for dracor.org
MIT License
10 stars 2 forks source link

Add short name for authors to metadata file #119

Closed lehkost closed 3 years ago

lehkost commented 3 years ago

For space-saving plotting and similar purposes, we should add a short version for author names to the delivered metadata. In the JSON file it could look like this:

"authors": [
  {
    "key": "pnd:118572121",
    "name": "Lessing, Gotthold Ephraim",
    "shortname": "Lessing"
  }
],

Short names could be extracted in two ways. If there's a proposed short name in the TEI itself, like in SpanDraCor (link to example) …

<name type="short">Galdós</name>

… we could use that. If there's no such thing, we could extract it following a rule, like: "Extract all letters until the first comma".

lehkost commented 3 years ago

Right now, our main corpora have author information like this, stored as string within the main element:

<author key="pnd:118607626">Schiller, Friedrich</author>

We should follow the example of SpanDraCor (see above) and generally introduce (additional) short name versions, either the surname or the better-known nom de plume (like for "Voltaire"). The Schiller example would look like this:

<author key="wikidata:Q22670">
  <name type="full">Schiller, Friedrich</name>
  <name type="short">Schiller</name>
  <idno type="pnd">118607626</idno>
</author>

Then we could also finally provide author information in the two-dimensional, table-style (csv) corpus metadata files, like this:

name,                           id,        first_author, number_of_coauthors, yearNormalized, […]
schiller-die-braut-von-messina, ger000067, Schiller,     0,                   1803, […]

In the case of co-authors, we would just put the number there (1 or 2) which we can use where needed. This will come in handy immediately for both the website and our R library in regard of a denser display of information.

Once we decide upon the way to introduce short names in the TEI files, I can update all corpora within a day.

cmil commented 3 years ago

An alternative would be to properly identify the semantic parts of a name, as is for instance done in SweDraCor, e.g.:

<author key="wikidata:Q7724" xml:id="StrindbergA">
  <persName>
    <forename>August</forename>
    <surname>Strindberg</surname>
  </persName>
  <idno type="LIBRIS">http://libris.kb.se/auth/94541</idno>
  <idno type="Litteraturbanken">http://litteraturbanken.se/#forfattare/StrindbergA</idno>
  <idno type="SBL">http://www.nad.riksarkivet.se/sbl/artikel/34518</idno>
</author>

The advantages of this approach would be several:

See the Personal Names section of the TEI documentation for more examples of how to mark up names in detail.

Also, should we at this occasion move the Wikidata ID from the key attribute to an idno element too? Thus we would have a uniform way of providing Identifiers and wouldn't need to code a decision about their hierarchy into the sources.

So the Schiller example given above would then looks like this:

<author>
  <persName>
    <surname>Schiller</surname>,
    <forename>Friedrich</forename>
  </persName>
  <idno type="wikidata">Q22670</idno>
  <idno type="pnd">118607626</idno>
</author>

A more complex example could be:

<author>
  <persName>
    <forename>Ramón</forename>
    <forename>María</forename>
    <nameLink>del</nameLink>
    <surname type="main">Valle</surname>
    <surname>Inclán</surname>
  </persName>
  <idno type="wikidata">Q311001</idno>
  <idno type="viaf">197999653</idno>
  <idno type="bne">XX1055436</idno>
</author>

Or a simpler one:


<author>
  <persName>Aristophanes</persName>
  <idno type="wikidata">Q43353</idno>
</author>
ingoboerner commented 3 years ago

+1 for harmonizing the identifier in <idno> as for the names: I like @cmil approach of having clients (e.g. the API) decide how to display the names and provide the richest possible material for that; but I'm not sure about the practicability of going into stuff like <nameLink>. In the case of rusdracor it would instantly raise the question if and how one should include the patronymic as well. Also, dracor is a project about drama and not so much about metadata on authors, so one could argue, that this information should be fetched from elsewhere (e.g. wikidata), if needed. In any case the odd + generated schema has to be adapted, I guess...

cmil commented 3 years ago

@ingoboerner Just for clarification, I'm not suggesting to augment the author data, but to more properly structure the information we already have instead of relying on implicit semantics conveyed by commas or the lack thereof. The complex case was an illustration of what would be possible. For practicality I would suggest the following minimum rules for adjusting our markup:

  1. the name is wrapped in a persName element
  2. where possible forename and surname are distinguished

If it makes things easier I would also be fine with something like this:

<author>
  <persName>
    <forename>Ramón María del</forename>
    <surname>Valle Inclán</surname>
  </persName>
  ...
</author>

(in which case of course we would have to fall back to "Valle Inclán" as the short name).

Rusdracor already has the patronymics. I would suggest to mark them as a second forename:

<author>
  <persName>
    <surname>Гоголь</surname>,
    <forename>Николай</forename>
    <forename>Васильевич</forename>
  </persName>
  ...
</author>

When it comes to the schema, I would say we should be permissive on the RNG side, possibly allowing anything that's valid tei_all. But we should add a schematron rule that warns about a missing persName and, in case there is more than one word or there is a comma inside a persName, about missing forename/surname elements. Something like "Consider using forename and surname to distinguish the respective author name parts."

lehkost commented 3 years ago

Thanks for the input! I immediately see two problems. 🙃

  1. Katharina II.
  2. Pen names like "Voltaire" or "Clarín" next to the actual names of authors.

I guess the first item would have to be a plain <persName>Katharina II.</persName>?

I agree with Ingo that we should not concentrate on too sophisticated name encoding in DraCor, since we have rich information anyway thanks to authority files. We should concentrate on what's best for information display on the website and via API.

I would definitely be okay with generally moving the IDs to <idno>. But still prefer the solution <name type="full"> + <name type="short"> for practicability and just so we don't have to get too deep into naming conventions.

cmil commented 3 years ago

@lehkost Yes, "Katharina II." would be the persName, no surname/forename needed. It would also be the short name version since I don't see any meaningful shorter version. The pen name problem, in my opinion, is not really solved by the simple distinction between a short or full version. What would be the alternatives for Voltaire? Voltaire/François-Marie Arouet? Then you loose the pen name for the full version which is probably not what you want. Voltaire/Voltaire? Then you don't need a distinction in the first place.

We now have the situation that author names are not uniformly structured through the corpora. In the API we try to guess how to split up names for sorting, uniform display etc., which doesn't work all that well. This could be the opportunity to fix this. And I think the best solution would be to tag the name parts as suggested in https://tei-c.org/release/doc/tei-p5-doc/en/html/ND.html#NDPER instead of coming up with our own imperfect solution.

What, for instance, would we do with SweDraCor? Changing the its forename/surname markup to <name type="full"> + <name type="short"> would mean we would loose existing information, while just adding our versions would unnecessarily clutter up the markup.

And finally, if we are not dealing with naming conventions in the markup we sooner or later have to in the API code, which IMHO is not the right place to do it.

lehkost commented 3 years ago

All good points, I think we're closer to a solution. But what about pen names, which should be preferred when displaying, because Voltaire is much more telling than Arouet (but there might be counter examples). For pen names, I would see the possibility to use <addName>, what do you think?

cmil commented 3 years ago

That's a good question. The TEI documentation calls addName (as well as forename, surname etc.) a "name component". One could argue that a pen name is not a component but another name on its own and therefore should be tagged as a persName by itself. The Grandma Moses example with <persName type="pseudo"> seems to support this.

Maybe we could establish the convention that when there are more than one persName elements inside an author the first one is the canonical one to be displayed in listings while the other(s) could be displayed as alternatives in a detail view. That would contradict the example on https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-author.html where two distinct authors are represented as different persNames inside the same author. But we are subverting this kind of model by putting idnos inside the author anyway.

lehkost commented 3 years ago

Ok, so we would do this:

<author>
  <persName type="pen">
    Voltaire
  </persName>
  <persName>
    <forename>François-Marie</forename>
    <surname>Arouet</surname>
  </persName>
</author>
<author>
  <persName type="pen">
    Clarín
  </persName>
  <persName>
    <forename>Leopoldo</forename>
    <surname>Alas</surname>
  </persName>
</author>
<author>
  <persName type="nobility">
    Екатерина II
  </persName>
  <persName>
    София Августа Фредерика Ангальт-Цербстская
  </persName>
</author>
<author>
  <persName>
    <forename>Anton</forename>
    <surname>Richter</surname>
  </persName>
  <persName type="pseudo">
    <forename>Ludwig</forename>
    <surname>Stahlpanzer</surname>
  </persName>
</author>

This proposition makes use of three corresponding attributes, pen, nobility and pseudo (all special cases I can think of right now, but there might be more 😊). Following your suggestion, we would always prefer the first <persName> for information display (and we could limit to the surname if available for "short" version?).

The fourth example refers to this play, where we dissolve the pseudonym used by the author, but would keep the used pseudonym as second <persName>.

IDNOs were left out in the examples above, but would have to be added, of course.

cmil commented 3 years ago

@lehkost this looks good to me.

lehkost commented 3 years ago

Ok then, so I wil start with GerDraCor…

cmil commented 3 years ago

Should we maybe also streamline the order of forename and surname? Currently we have surname, forename in GerDraCor and RusDraCor, while SweDraCor and SpanDraCor (without tagging though) has forename surname. I would prefer the latter, "natural" order.

lehkost commented 3 years ago

I agree, will adjust the order of names.