dracor-org / dracor-api

eXistdb application for dracor.org
MIT License
10 stars 2 forks source link

Inconsistencies in metadata names #186

Closed hsluytergaethje closed 1 year ago

hsluytergaethje commented 1 year ago

The names of columns or dictionary keys are sometimes inconsistent or redundant. Below I listed these cases for the output format of the different API calls:

API call /corpora/{corpusname}

In 'dramas':

Redundancies

Capitalization

Needs clarification

API call /corpora/{corpusname}/metadata

Redundancies

Abbreviations

Patterns

API call /corpora/{corpusname}/metadata/csv

Difference to JSON

API call /corpora/{corpusname}/play/{playname}

Difference to corpus metadata

cmil commented 1 year ago

@hsluytergaethje thanks for this impressive list of oddities. It will be a great help in improving the overall consistency of the DraCor API. I created a few tickets to take care of the issues raised. In some cases, however, what seems like an inconsistency actually has some motivation behind it. Please, see the comments below.

Redundancies

  • yearPrinted == printYear
  • yearWritten == writtenYear
  • yearPremiered == premiereYear --> In play/{playname}: yearWritten, yearPremiered, yearPrinted, yearNormalized

The *Year properties are there for backwards compatibility and should be removed at some point: there is now #188.

  • all information in 'author' are also in 'authors' - no deprecation warning as for /play/{playname}

I created #187 to take care of it.

Capitalization

  • 'wikidataID' but 'fullname' and 'shortname' (in 'authors')

While it's true that this naming does not look overly consistent, I would suggest not to change these in order to reduce the introduction of breaking changes. In the case of fullname and shortname I would argue that these are perfectly legible (in contrast, for instance, to wikidataid) and can in fact sometimes be found spelled as one word (very much like "filename"). So the introduction of an inner capital would not improve things all that much. I'm aware that this is also a matter of personal taste, so if there is a strong urge for renaming these properties feel free to open an issue.

Needs clarification

  • difference between 'name' and 'fullname' is only the format - not obvious

It is not so much a matter of format rather than function. The name property can be used for sorting by author alphabetically. For instance in HunDraCor in most if not all cases name and fullname would be the same. I think we refrained from renaming the name to something like sortname when we revised the authors property for backwards compatibility reasons. I suggest to stick to this decision but agree that it should be explained wherever we are going to document the JSON output of the API in the future.

API call /corpora/{corpusname}/metadata

Redundancies

  • playName == name

These are indeed redundant. I don't remember why both have been introduced in 30bf0314bc609095dd701ceb39491b6982022e3f, but playName does not seem to be used anywhere else in dracor-api nor in dracor-frontend and could be removed, I guess.

Abbreviations

  • Publisher vs. Pub (Publication)?

    • originalSourcePublisher BUT originalSourcePubPlace

Sometimes I think brevity trumps verbosity, and I find originalSourcePubPlace easier to read and mentally parse than originalSourcePublicationPlace. That's personal taste though and I wouldn't object to changing it considering that it has been introduced only recently and breakage could be limited.

  • Speaker vs. Sp

    • numOfSpeakers BUT wordCountSp

"Speakers" here actually refers to entries in the particDesc which could be either person or personGrp (which btw should be explained somewhere too). "Sp" on the other hand refers to the actual sp element. So I wouldn't consider this an inconsistency rather than two different things named accordingly.

  • Acts vs. L (ines) and P (aragraphs):

    • numOfActs BUT numOfP and numOfL

Same here: acts are encoded as div elements with a certain type attribute while p and l are the actual TEI elements numOfP and numOfL refer to. This difference may not be obvious to the casual API user, but using numOfLines or numOfParagraphs would be even more ambiguous.

  • num (in networkdata) vs. number (in metadata)

    • numPersonGroups BUT originalSourceNumberOfPages

Here I don't see any reason for the inconsistency other than we didn't think about it. So I guess we should change it.

  • average vs. max

    • averageDegree BUT maxdegree

In my opinion max and min are so widely used as abbreviations throughout different programming languages that I would find it almost irritating to see maximum or minimum used instead. The same is not true for avrg though.

Patterns

  • numOf vs. num

    • numEdges
    • numConnectedComponents
    • numOfPersonGroups
    • numOfSpeakers

Are there any suggestions into which direction we should align this @lehkost or @peertrilcke? I don't have a strong preference but it seems that the numOf* form is already used more often, so maybe that should be the one.

  • adjective or noun first

    • normalizedGenre vs. yearNormalized

For me the difference in name construction has a subtle meaning: there is only one genre per play, it just happens to be normalised (what that means needs explanation though). But there are different kinds of years which is what I usually assume when the noun appears before some kind of adjective.

API call /corpora/{corpusname}/metadata/csv

Difference to JSON

  • numPersonGroups (csv) BUT numOfPersonGroups (json)

This is actually a bug. I would expect the column numPersonGroups in the CSV to never show any values. It should be an easy fix.

Otherwise in csv output same problems as for names in JSON

API call /corpora/{corpusname}/play/{playname}

Difference to corpus metadata

  • genre BUT normalizedGenre

normalizedGenre has been introduced in the context of #130. Apparently we missed to rename other occurrences of the property. There is now #189.

lehkost commented 1 year ago

@hsluytergaethje @cmil: Thanks so-so much for taking the time to document and comment this. Like @cmil, I have no strong preferences for naming schemes in all the cases mentioned, it should just be consistent.

cmil commented 1 year ago

I created an API consolidation project to keep track of related issues.

cmil commented 1 year ago

All the follow-up issues to this one have been resolved. Feel free to open new issues where inconsistencies continue to be obtrusive.