Standardizing attributions for display on scaife.perseus.org

jacobwegner commented 3 years ago

I've created this issue to track updates to the underlying attribution data that we're now extracting / displaying on scaife.perseus.org

Overview

I've extracted the existing attributions (from respStmt elements) and exported them to a Google Spreadsheet, OGL - First1kGreek Attributions. I can grant access to the appropriate persons within OGL to perform bulk edits to the data.

Once the preferred edits have been made to the spreadsheet, I will use the spreadsheet to bulk update the underlying XML files with the new attribution information and open a pull request.

If this workflow works well, we can do it for other OGL repos (and ideally any other repos contributing texts to scaife.perseus.org)

Desired data model

Here are a few samples of what the updated respStmt elements will look like:

Thibault Clérice, Lead Developer (University of Leipzig) 2015 - 2017

From https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/tlg0062/tlg001/tlg0062.tlg001.1st1K-grc1.xml#L28

to:

<respStmt>
  <resp from="2015" to="2017">Lead Developer</resp>
  <persName ref="https://orcid.org/0000-0003-1852-9204">Thibault Clérice</persName>
  <orgName>University of Leipzig</orgName>
</respStmt>

Notes:

We make use of from and to attrs to denote the timeframe of the resp.
We set a person's ORCID in persName.ref

Simona Stoyanova, Project Manager (University of Leipzig), 2015, Project Assistant (University of Leipzig), 2013-2014

From https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/stoa0146d/stoa001/stoa0146d.stoa001.opp-grc1.xml#L47

to:

<respStmt>
  <resp when="2015">Project Manager</resp>
  <persName>Simona Stoyanova</persName>
  <orgName>University of Leipzig</orgName>
</respStmt>
<respStmt>
  <resp from="2013" to="2014">Project Assistant</resp>
  <persName>Simona Stoyanova</persName>
  <orgName>University of Leipzig</orgName>
</respStmt>

Notes:

We move from a single respStmt containing two resp elements to a 1:1 relationship between respStmt and resp
when and from|to attrs denote the resp. timeframe

Gregory Crane, Leonard Muellner, Bruce Robertson, Published original versions of the electronic texts, Open Greek and Latin

From https://github.com/OpenGreekAndLatin/First1KGreek/blob/3f5519b9a01ca4ff5eb56048868e83844e7755ab/data/tlg0093/tlg005/tlg0093.tlg005.1st1K-grc1.xml#L12

to:

<respStmt>
  <resp>Published original versions of the electronic texts</resp>
  <persName role="principal">Gregory Crane</persName>
  <orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
</respStmt>
<respStmt>
  <resp>Published original versions of the electronic texts</resp>
  <persName role="principal">Leonard Muellner</persName>
  <orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
</respStmt>
<respStmt>
  <resp>Published original versions of the electronic texts</resp>
  <persName role="principal">Bruce Robertson</persName>
  <orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
</respStmt>

Notes:

We move from a single respStmt containing multiple persName elements to a 1:1 relationship between respStmt and persName.
We also include orgName in each respStmt

Implementation

Extraction process

Each row in the attributions-data worksheet corresponds to a set of URNs extracted from the underlying XML files.

There are "key" and "urn" fields which should not be modified and will be used to perform the bulk update.

Editing attribution data in the spreadsheet

I went through and made an initial pass to clean up the data. This involved fixing small typos in organization names, normalizing names (Mt. Allison vs Mount Allison, etc) and restructuring data to fit the desired model (discussed below).

The unique-* worksheets show uniquevalues for the resp, orgName and persName.

Ideally, we can standardize on "Proofreading" vs "proofreader" vs "Proofreading and CTS conversion" as appropriate. If proofreading and CTS conversion are two distinct responsibilities for a given text, I would suggest:

1) Adding an additional row beneath "Proofreading and CTS conversion"

2) Edit the original resp to Proofreading

3) Set the resp in the new row to CTS conversion

4) Copy the other relevant fields (resp, orgName and persName) to the new row

5) Leave a comment on the row so I can ensure that the urn and key fields are also populated.

There are also several instances where slight variants in a person's name are used, or resp possibly contains data better suited for orgName .

We should not delete any rows; if there are duplicate rows in the spreadsheet, we'll use the urn and key fields to de-duplicate data.

Bulk update process

Once edits have been finalized in the spreadsheet, I'll use the urn and key fields to map the edits back to the desired data model (see below)

I will also perform a reordering of the desired "proofreading / conversion" role(s) so that they are weighted before any other roles.

I'll open up a PR and link it back to this issue. The PR can be merged and then the updated attributions will be made available on scaife.perseus.org

Closing thoughts

I'm not sure if there is "template" for future XML files, but I would also be happy to take the examples in Desired data model above and integrate them into that template.

As long as the XML files have respStmt with resp and one of persName or orgName, we can extract attributions for display on scale.perseus.org.

lcerrato commented 3 years ago

Hi @jacobwegner This is such a large undertaking, I hesitate to offer the first comments.

The working header template is here: https://docs.google.com/document/d/16fZThJUuTwJJKFiJgi1cLW0ONKeVnE3MdqMHVdCcmDM/edit#heading=h.c5otxj8nwtfd

Overall the header consistency is a challenge that extends beyond the credits. We know there is a lot of work to be done on the older file headers. As you note, we have inconsistent labels. Some describe roles, others describe tasks. (The practice of using the dates is not something we want to preserve.)

We need better labels that fully capture the contributions of the students, so we have to settle on the vocabulary. I don't know that we can do much with the older work. There are just some roles/labels there that we no longer use. I think much of the older stuff was boilerplate.

I prefer a lighter and streamlined header, so I would lean against splitting up a description such as "Proofreading and CTS conversion." I also like keeping Greg, Lenny, and Bruce together in their grouping — it makes for less bulk and more readability. If it doesn't work, that's fine.

I presume the file headers should read in order of the desired presentation, if so, that impacts how files are exported from Lace. I do not know if that affects Zenodo display — I believe we added the three top names as part of the bibliographic cleanup so that we would have a consistent set of names to appear at the top of our releases.

Will people be displayed as role/name/org or name/org/role? If it's the former, perhaps we could toggle the roles, so that

jacobwegner commented 3 years ago

@lcerrato Thanks for your feedback and the link to the existing header template–very helpful!

We need better labels that fully capture the contributions of the students, so we have to settle on the vocabulary. I don't know that we can do much with the older work. There are just some roles/labels there that we no longer use. I think much of the older stuff was boilerplate.

Part of my goal in providing the spreadsheet was to demonstrate some of the variance between labels and headers; I know the "hard" part is building that vocabulary, as you've said, but hopefully the bulk update process helps to simplify updating them once the vocabulary is established.

I prefer a lighter and streamlined header, so I would lean against splitting up a description such as "Proofreading and CTS conversion." I also like keeping Greg, Lenny, and Bruce together in their grouping — it makes for less bulk and more readability. If it doesn't work, that's fine.

I don't intend to enforce a "Scaife-specific" convention on the headers, and the readability issue is a good one to bring up.

In the Scaife attributions model, we tie each attribution to a person and or organization.

See for example, tlg0093.tlg005.1st1K-grc1:

If the desire is to have those principals and OGL as the first respStmt, would this be agreeable?

<respStmt>
  <resp>Published original versions of the electronic texts</resp>
  <orgName ref="https://www.opengreekandlatin.org">
    Open Greek and Latin
    <persName role="principal">Gregory Crane</persName>
    <persName role="principal">Leonard Muellner</persName>
    <persName role="principal">Bruce Robertson</persName>
  </orgName>
</respStmt>

I think we could account for that type of nested respStmt in our attribution model, and likely even "group" the principals into a single entry.

I presume the file headers should read in order of the desired presentation, if so, that impacts how files are exported from Lace.

That's correct, but I could also see us making use of the n attribute on a respStmt to control "weight", e.g.:

<respStmt>
  <resp>Published original versions of the electronic texts</resp>
  <orgName ref="https://www.opengreekandlatin.org">
    Open Greek and Latin
    <persName role="principal">Gregory Crane</persName>
    <persName role="principal">Leonard Muellner</persName>
    <persName role="principal">Bruce Robertson</persName>
  </orgName>
</respStmt>
<respStmt n="1">
  <resp>Proofreading</resp>
  <orgName>Mt. Allison University</orgName>
  <persName>Kirsten Mason</persName>
</respStmt>
<respStmt n="2">
  <resp>CTS conversion</resp>
  <orgName>Center for Hellenic Studies</orgName>
  <persName>Michael Konieczny</persName>
</respStmt>

(Where a respStmt without n would be sorted to the bottom of those numbered, etc).

So ideally I'd like to come up with a convention that works for you / OGL to allow the Scaife Viewer environment to extract attributions from the files directly, but in the case of the "principals" header and sorting of proofreading / CTS conversions, I could also see having a repository-level mapping configuration file (much like .zenodo.json where we could enforce the mapping).

Another way to say this is that we want to place as much control as possible of the display of attributions with the providing data source (so the Scaife dev team is not a bottleneck on updating mappings, etc).

With either the convention or the mapping file, I'd expect to end up with something more like this in Scaife:

(With the near-term goal of being able to click on Kristen's or Michael's name to see other works they have contributed to)

I do not know if that affects Zenodo display — I believe we added the three top names as part of the bibliographic cleanup so that we would have a consistent set of names to appear at the top of our releases.

I don't know for certain either, but it looks like .zenodo.json controls the Zenodo display directly.

Will people be displayed as role/name/org or name/org/role? If it's the former, perhaps we could toggle the roles, so that

I think you may have got cut off here, but currently we display name + org and then role:

lcerrato commented 3 years ago

@jacobwegner Yes, I didn't finish my thoughts — too many revisions.

I would like other input on this but I like the middle display and the notion of adding the weighted attribute. My only concern on that would be that is an easy thing to miss in file review and it wouldn't be in the testing regime.

jacobwegner commented 3 years ago

@lcerrato I'd be happy to change the scope of this bulk update to just cleaning up / standardizing the existing attributions.

If they're standardized, I'd be happy to write small config file that could handle the "weighting" (e.g.):

For resp values like Published original versions of the electronic texts, weight lower
For resp values like Proofreading or Proofreading and CTS conversion or CTS conversion, weight higher
For other resp values, keep the order in the file

That would keep you / other repo maintainers in control of the weighting without having to rely on keeping the attributes up to date.

lcerrato commented 3 years ago

@jacobwegner Thanks. I've brought this to the group and we have some good progress on language and management. I'll update soon.

lcerrato commented 3 years ago

Hi @jacobwegner! @ThomasK81 @brobertson @lmuellner @LucieSty @mkonieczny9805 @AlisonBabeu @gregorycrane @jtauber So first resolution is that proofreading/ proofreader should read "Digital conversion and editing". Those with this designation should be displayed first.

I believe that others who have CTS conversion or something similar in the line should be "Digital editor" — (not entirely sure we settled on that). This would be the second level of display.

I can get to work spotting other issues.

One thing that @ThomasK81 mentioned is that while the group preferred the middle display example, (with Kirsten Mason first), the OGL itself is in parentheses with the three director names. Note on the last screen shot, the Org name appears first.

Would it be possible to have<orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName> with a live link at the top of the list while still preserving the rest of the weighted order we propose? Is that practical in this stage or should it wait until we delve further into the header exposure? (We know that sidebar space is precious at this point and that there will need to be a separate jump off for other file info.)

If I/we want to regularize other information, it should be on the first tab, yes? (I saw some minor inconsistencies and started fixing them.) Edit: I see your notes above. I will make notations to my changes and then go from there.

Let me know the next steps.

lcerrato commented 3 years ago

@jacobwegner Let's strike the "from" and "to" attributes.

jacobwegner commented 3 years ago

(As I mentioned to @lcerrato in chat, outside of the urns and key fields, we can just do the edits directly in the document, so I ended up just deleting the when, to and from columns)

lcerrato commented 3 years ago

@jacobwegner Thanks! Will resume next week when I'm less distracted.

jtauber commented 3 years ago

Should we just use the publication statement to get the OGL info rather than "hack" the OGL into the responsibility statement?

jacobwegner commented 3 years ago

Here's the latest draft of the widget:

Publication statement is extracted and used as the first entry
We regroup names underneath each "role" (resp in TEI terms)

lcerrato commented 3 years ago

@ahanhardt @brobertson There is a Janan Assaly listed in these files as being affiliated both with CHS and Mount Allison. Which is correct?

lcerrato commented 3 years ago

@jacobwegner I can't insert any rows. I just need to add some attributions to 5 files. All the same person.

jacobwegner commented 3 years ago

@lcerrato I have just removed the locks so you can add rows.

I hadn't been thinking about adding new entries from the sheet (just updating), but as long as you populate the urns cell for each entry and leave the key blank, that should work.

Happy to hop on a chat or call to discuss further.

lcerrato commented 3 years ago

@jacobwegner I can add manually if that is best done in the next phase of work. I was actually just wondering if I wanted to do that anyhow in case I spot other things.

jacobwegner commented 3 years ago

@lcerrato: I'm happy to do add them now or wait until the first batch is done.

If we do it now (and we're adding a name rather than modifying an existing name), let's use the "New Attributions" worksheet:

lcerrato commented 3 years ago

@jacobwegner So I've reviewed all of the first spreadsheet and fixed the resp and orgNames. I still have one question from earlier above where someone has two different organizations.

I did not edit the other tabs excepting when I first fixed Greg's name and the new tab you added above.

jacobwegner commented 3 years ago

@lcerrato Great! I'll try and circle back to this tomorrow.

As far as someone having two different organizations, that may be desirable for OGL, but completely optional / allowable to the underlying data model in Scaife Viewer. We'd show that a person affiliated with Org A contributed on Text 1 and then that same person affiliated with Org B on Text 2.

lcerrato commented 3 years ago

@jacobwegner Thanks, in this case, the two organizations are clearly an error.

lcerrato commented 3 years ago

@jacobwegner I think the front tab is now correct. It reads as the group wants it to read.

jacobwegner commented 3 years ago

@lcerrato I'll circle back here this week; apologies for the delay.

jacobwegner commented 3 years ago

@lcerrato Apologies that it has taken awhile for me to finish this task.

Here's a status update:

Originally, I had hoped to apply the changes made in the spreadsheet "in place" on the existing XML files, updating or appending additional respStmt elements.

Unfortunately, there isn't a good consistent way to do this because:

If I were to use something like lxml, it would apply its own whitespace and indentation formatting across every impacted file. That'd generate an extremely large Git diff, and a lot of "noise" if someone needed to look at the history of a file using git blame
I started to write a small script to just evaluate / replace each individual respStmt without using a full-blown XML parser, but abandoned that because there were too many inconsistencies in reading / writing each file. Like the lxml issue, this was largely due to differences in whitespace and indentation
I gave final go at manually updated some of the statements using find and replace in a text editor, but I found I was either introducing more inconsistencies into the file or skipping corrections that should have been made as listed in the spreadsheet.

Talking with @jtauber, I came up with what I think is a good compromise.

I've created a configuration file that is currently in the scaife-viewer/scaife-viewer.

This file has substitutions that map the respStmt elements from the XML through the corrections that were made on the spreadsheet. As Scaife Viewer loads attributions from this repository, it will compare each respStmt to the substitutions list and apply a replacement.

I will soon create a pull request that moves that file to this repo and provides a brief overview of how it works. I think that will help us "clean up" the older attribution data without bloating up the Git history.

I can also document our discussion on this thread on the "preferred" structure of a respStmt for new texts added to the repo going forward.

If there are further tweaks that need to be made to the substitutions list, any repo maintainer here can make changes to the config file, and Scaife Viewer will pick up the changes the next time that this repository is ingested.

The development instance (https://scaife-dev.perseus.org) has these substitutions applied.

When I circle back to move the config file to this repo, I can also ensure we're ingesting the latest release onto the production scaife.perseus.org.

lcerrato commented 3 years ago

@jacobwegner I can foresee a time where we decide to just integrate changes into the headers gradually (especially since the older headers need other edits). It would always be my preference to enforce the consistency within the files themselves, but realistically, this sounds like a fine solution given the present resources we have.

lcerrato commented 3 years ago

@jacobwegner I think that the group would like some info on what the headers need to look like so we can move stuff out of limbo. We can also discuss just changing the most recent files manually — since we are concerned about the last couple of years right now rather than the earlier work. Lenny may be getting in touch on how best to move things along without over stressing all of us.

jacobwegner commented 3 years ago

@lcerrato

Just recapping the current options as I see them:

1) Use a configuration file that maps existing respStmt elements to their preferred values I've added the file in https://github.com/OpenGreekAndLatin/First1KGreek/pull/2329 and before requesting merge will ensure that we've got better documentation on how attributions.promoted and attributions.substitutions are used

2) Reformat and update headers I've prepared a demonstration of what the reformatting + headers update would look like in terms of the Git diff

https://github.com/OpenGreekAndLatin/First1KGreek/pull/2328

3) Generate an archive file containing the updated resp stmts to be applied manually

We're using "1)" on the current deployment at https://scaife-dev.perseus.org/.

I think we'd continue to use the config file to "promote" certain roles (which would allow the existing resp "Published original versions of the electronic texts" structured preferred for Zenodo, etc).

Please let me know if I can help answer further questions or if you'd like me to be part of a call to discuss further.

lcerrato commented 3 years ago

@jacobwegner Thanks for the update. I don't think I'm the right person for much of this as I don't fully understand the changes — and I'm not sure how best to manage this within the current workflow, such as it is.

Future headers: I don't see any substantive changes to the header itself. I know we discussed reordering things, weighting attributes, consistent presentation, etc., but if we were to create a file today, is there something that needs to change in the header structure or markup itself? I get a sense I am missing something there.

Existing headers: I don't think there is way around manually applying the consistent role vocabulary to the files. Otherwise, we'd have xml files that have outdated headers. (I guess your point 3) is what is going to assist with that?) The group would like me to work backwards since the most important stuff is the recent student work. Do I have that right? Is that work going to create conflicts in the way the software handles things?

Both of these points have to do with what I can do now and any implications for what you are doing on the SV side. I don't want to resume work using bad headers and I don't want to fix old headers in a way that conflicts with the configuration file.

TL:DR What should I be doing now that would be most useful for you? Should I go ahead and run some new headers or other edits by you?

OpenGreekAndLatin / First1KGreek