Open jacobwegner opened 3 years ago
Hi @jacobwegner This is such a large undertaking, I hesitate to offer the first comments.
The working header template is here: https://docs.google.com/document/d/16fZThJUuTwJJKFiJgi1cLW0ONKeVnE3MdqMHVdCcmDM/edit#heading=h.c5otxj8nwtfd
Overall the header consistency is a challenge that extends beyond the credits. We know there is a lot of work to be done on the older file headers. As you note, we have inconsistent labels. Some describe roles, others describe tasks. (The practice of using the dates is not something we want to preserve.)
We need better labels that fully capture the contributions of the students, so we have to settle on the vocabulary. I don't know that we can do much with the older work. There are just some roles/labels there that we no longer use. I think much of the older stuff was boilerplate.
I prefer a lighter and streamlined header, so I would lean against splitting up a description such as "Proofreading and CTS conversion." I also like keeping Greg, Lenny, and Bruce together in their grouping — it makes for less bulk and more readability. If it doesn't work, that's fine.
I presume the file headers should read in order of the desired presentation, if so, that impacts how files are exported from Lace. I do not know if that affects Zenodo display — I believe we added the three top names as part of the bibliographic cleanup so that we would have a consistent set of names to appear at the top of our releases.
Will people be displayed as role/name/org or name/org/role? If it's the former, perhaps we could toggle the roles, so that
@lcerrato Thanks for your feedback and the link to the existing header template–very helpful!
We need better labels that fully capture the contributions of the students, so we have to settle on the vocabulary. I don't know that we can do much with the older work. There are just some roles/labels there that we no longer use. I think much of the older stuff was boilerplate.
Part of my goal in providing the spreadsheet was to demonstrate some of the variance between labels and headers; I know the "hard" part is building that vocabulary, as you've said, but hopefully the bulk update process helps to simplify updating them once the vocabulary is established.
I prefer a lighter and streamlined header, so I would lean against splitting up a description such as "Proofreading and CTS conversion." I also like keeping Greg, Lenny, and Bruce together in their grouping — it makes for less bulk and more readability. If it doesn't work, that's fine.
I don't intend to enforce a "Scaife-specific" convention on the headers, and the readability issue is a good one to bring up.
In the Scaife attributions model, we tie each attribution to a person and or organization.
See for example, tlg0093.tlg005.1st1K-grc1:
If the desire is to have those principals and OGL as the first respStmt
, would this be agreeable?
<respStmt>
<resp>Published original versions of the electronic texts</resp>
<orgName ref="https://www.opengreekandlatin.org">
Open Greek and Latin
<persName role="principal">Gregory Crane</persName>
<persName role="principal">Leonard Muellner</persName>
<persName role="principal">Bruce Robertson</persName>
</orgName>
</respStmt>
I think we could account for that type of nested respStmt
in our attribution model, and likely even "group" the principals into a single entry.
I presume the file headers should read in order of the desired presentation, if so, that impacts how files are exported from Lace.
That's correct, but I could also see us making use of the n
attribute on a respStmt
to control "weight", e.g.:
<respStmt>
<resp>Published original versions of the electronic texts</resp>
<orgName ref="https://www.opengreekandlatin.org">
Open Greek and Latin
<persName role="principal">Gregory Crane</persName>
<persName role="principal">Leonard Muellner</persName>
<persName role="principal">Bruce Robertson</persName>
</orgName>
</respStmt>
<respStmt n="1">
<resp>Proofreading</resp>
<orgName>Mt. Allison University</orgName>
<persName>Kirsten Mason</persName>
</respStmt>
<respStmt n="2">
<resp>CTS conversion</resp>
<orgName>Center for Hellenic Studies</orgName>
<persName>Michael Konieczny</persName>
</respStmt>
(Where a respStmt
without n would be sorted to the bottom of those numbered, etc).
So ideally I'd like to come up with a convention that works for you / OGL to allow the Scaife Viewer environment to extract attributions from the files directly, but in the case of the "principals" header and sorting of proofreading / CTS conversions, I could also see having a repository-level mapping configuration file (much like .zenodo.json
where we could enforce the mapping).
Another way to say this is that we want to place as much control as possible of the display of attributions with the providing data source (so the Scaife dev team is not a bottleneck on updating mappings, etc).
With either the convention or the mapping file, I'd expect to end up with something more like this in Scaife:
(With the near-term goal of being able to click on Kristen's or Michael's name to see other works they have contributed to)
I do not know if that affects Zenodo display — I believe we added the three top names as part of the bibliographic cleanup so that we would have a consistent set of names to appear at the top of our releases.
I don't know for certain either, but it looks like .zenodo.json controls the Zenodo display directly.
Will people be displayed as role/name/org or name/org/role? If it's the former, perhaps we could toggle the roles, so that
I think you may have got cut off here, but currently we display name + org and then role:
@jacobwegner Yes, I didn't finish my thoughts — too many revisions.
I would like other input on this but I like the middle display and the notion of adding the weighted attribute. My only concern on that would be that is an easy thing to miss in file review and it wouldn't be in the testing regime.
@lcerrato I'd be happy to change the scope of this bulk update to just cleaning up / standardizing the existing attributions.
If they're standardized, I'd be happy to write small config file that could handle the "weighting" (e.g.):
resp
values like Published original versions of the electronic texts
, weight lowerresp
values like Proofreading
or Proofreading and CTS conversion
or CTS conversion
, weight higherresp
values, keep the order in the fileThat would keep you / other repo maintainers in control of the weighting without having to rely on keeping the attributes up to date.
@jacobwegner Thanks. I've brought this to the group and we have some good progress on language and management. I'll update soon.
Hi @jacobwegner! @ThomasK81 @brobertson @lmuellner @LucieSty @mkonieczny9805 @AlisonBabeu @gregorycrane @jtauber So first resolution is that proofreading/ proofreader should read "Digital conversion and editing". Those with this designation should be displayed first.
I believe that others who have CTS conversion or something similar in the line should be "Digital editor" — (not entirely sure we settled on that). This would be the second level of display.
I can get to work spotting other issues.
One thing that @ThomasK81 mentioned is that while the group preferred the middle display example, (with Kirsten Mason first), the OGL itself is in parentheses with the three director names. Note on the last screen shot, the Org name appears first.
Would it be possible to have<orgName ref="https://www.opengreekandlatin.org">Open Greek and Latin</orgName>
with a live link at the top of the list while still preserving the rest of the weighted order we propose? Is that practical in this stage or should it wait until we delve further into the header exposure? (We know that sidebar space is precious at this point and that there will need to be a separate jump off for other file info.)
If I/we want to regularize other information, it should be on the first tab, yes? (I saw some minor inconsistencies and started fixing them.) Edit: I see your notes above. I will make notations to my changes and then go from there.
Let me know the next steps.
@jacobwegner Let's strike the "from" and "to" attributes.
(As I mentioned to @lcerrato in chat, outside of the urns
and key
fields, we can just do the edits directly in the document, so I ended up just deleting the when
, to
and from
columns)
@jacobwegner Thanks! Will resume next week when I'm less distracted.
Should we just use the publication statement to get the OGL info rather than "hack" the OGL into the responsibility statement?
Here's the latest draft of the widget:
resp
in TEI terms)@ahanhardt @brobertson There is a Janan Assaly listed in these files as being affiliated both with CHS and Mount Allison. Which is correct?
@jacobwegner I can't insert any rows. I just need to add some attributions to 5 files. All the same person.
@lcerrato I have just removed the locks so you can add rows.
I hadn't been thinking about adding new entries from the sheet (just updating), but as long as you populate the urns cell for each entry and leave the key blank, that should work.
Happy to hop on a chat or call to discuss further.
@jacobwegner I can add manually if that is best done in the next phase of work. I was actually just wondering if I wanted to do that anyhow in case I spot other things.
@lcerrato: I'm happy to do add them now or wait until the first batch is done.
If we do it now (and we're adding a name rather than modifying an existing name), let's use the "New Attributions" worksheet:
@jacobwegner So I've reviewed all of the first spreadsheet and fixed the resp and orgNames. I still have one question from earlier above where someone has two different organizations.
I did not edit the other tabs excepting when I first fixed Greg's name and the new tab you added above.
@lcerrato Great! I'll try and circle back to this tomorrow.
As far as someone having two different organizations, that may be desirable for OGL, but completely optional / allowable to the underlying data model in Scaife Viewer. We'd show that a person affiliated with Org A contributed on Text 1 and then that same person affiliated with Org B on Text 2.
@jacobwegner Thanks, in this case, the two organizations are clearly an error.
@jacobwegner I think the front tab is now correct. It reads as the group wants it to read.
@lcerrato I'll circle back here this week; apologies for the delay.
@lcerrato Apologies that it has taken awhile for me to finish this task.
Here's a status update:
Originally, I had hoped to apply the changes made in the spreadsheet "in place" on the existing XML files, updating or appending additional respStmt
elements.
Unfortunately, there isn't a good consistent way to do this because:
git blame
respStmt
without using a full-blown XML parser, but abandoned that because there were too many inconsistencies in reading / writing each file. Like the lxml
issue, this was largely due to differences in whitespace and indentationTalking with @jtauber, I came up with what I think is a good compromise.
I've created a configuration file that is currently in the scaife-viewer/scaife-viewer.
This file has substitutions that map the respStmt
elements from the XML through the corrections that were made on the spreadsheet. As Scaife Viewer loads attributions from this repository, it will compare each respStmt
to the substitutions list and apply a replacement.
I will soon create a pull request that moves that file to this repo and provides a brief overview of how it works. I think that will help us "clean up" the older attribution data without bloating up the Git history.
I can also document our discussion on this thread on the "preferred" structure of a respStmt
for new texts added to the repo going forward.
If there are further tweaks that need to be made to the substitutions list, any repo maintainer here can make changes to the config file, and Scaife Viewer will pick up the changes the next time that this repository is ingested.
The development instance (https://scaife-dev.perseus.org) has these substitutions applied.
When I circle back to move the config file to this repo, I can also ensure we're ingesting the latest release onto the production scaife.perseus.org.
@jacobwegner I can foresee a time where we decide to just integrate changes into the headers gradually (especially since the older headers need other edits). It would always be my preference to enforce the consistency within the files themselves, but realistically, this sounds like a fine solution given the present resources we have.
@jacobwegner I think that the group would like some info on what the headers need to look like so we can move stuff out of limbo. We can also discuss just changing the most recent files manually — since we are concerned about the last couple of years right now rather than the earlier work. Lenny may be getting in touch on how best to move things along without over stressing all of us.
@lcerrato
Just recapping the current options as I see them:
1) Use a configuration file that maps existing respStmt
elements to their preferred values
I've added the file in https://github.com/OpenGreekAndLatin/First1KGreek/pull/2329 and before requesting merge will ensure that we've got better documentation on how attributions.promoted
and attributions.substitutions
are used
2) Reformat and update headers I've prepared a demonstration of what the reformatting + headers update would look like in terms of the Git diff
https://github.com/OpenGreekAndLatin/First1KGreek/pull/2328
3) Generate an archive file containing the updated resp stmts to be applied manually
We're using "1)" on the current deployment at https://scaife-dev.perseus.org/.
I think we'd continue to use the config file to "promote" certain roles (which would allow the existing resp "Published original versions of the electronic texts" structured preferred for Zenodo, etc).
Please let me know if I can help answer further questions or if you'd like me to be part of a call to discuss further.
@jacobwegner Thanks for the update. I don't think I'm the right person for much of this as I don't fully understand the changes — and I'm not sure how best to manage this within the current workflow, such as it is.
Future headers: I don't see any substantive changes to the header itself. I know we discussed reordering things, weighting attributes, consistent presentation, etc., but if we were to create a file today, is there something that needs to change in the header structure or markup itself? I get a sense I am missing something there.
Existing headers: I don't think there is way around manually applying the consistent role vocabulary to the files. Otherwise, we'd have xml files that have outdated headers. (I guess your point 3) is what is going to assist with that?) The group would like me to work backwards since the most important stuff is the recent student work. Do I have that right? Is that work going to create conflicts in the way the software handles things?
Both of these points have to do with what I can do now and any implications for what you are doing on the SV side. I don't want to resume work using bad headers and I don't want to fix old headers in a way that conflicts with the configuration file.
TL:DR What should I be doing now that would be most useful for you? Should I go ahead and run some new headers or other edits by you?
I've created this issue to track updates to the underlying attribution data that we're now extracting / displaying on scaife.perseus.org
Overview
I've extracted the existing attributions (from
respStmt
elements) and exported them to a Google Spreadsheet, OGL - First1kGreek Attributions. I can grant access to the appropriate persons within OGL to perform bulk edits to the data.Once the preferred edits have been made to the spreadsheet, I will use the spreadsheet to bulk update the underlying XML files with the new attribution information and open a pull request.
If this workflow works well, we can do it for other OGL repos (and ideally any other repos contributing texts to scaife.perseus.org)
Desired data model
Here are a few samples of what the updated
respStmt
elements will look like:Thibault Clérice, Lead Developer (University of Leipzig) 2015 - 2017
From https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/tlg0062/tlg001/tlg0062.tlg001.1st1K-grc1.xml#L28
to:
Notes:
from
andto
attrs to denote the timeframe of the resp.persName.ref
Simona Stoyanova, Project Manager (University of Leipzig), 2015, Project Assistant (University of Leipzig), 2013-2014
From https://github.com/OpenGreekAndLatin/First1KGreek/blob/master/data/stoa0146d/stoa001/stoa0146d.stoa001.opp-grc1.xml#L47
to:
Notes:
resp
elements to a 1:1 relationship betweenrespStmt
andresp
when
andfrom|to
attrs denote the resp. timeframeGregory Crane, Leonard Muellner, Bruce Robertson, Published original versions of the electronic texts, Open Greek and Latin
From https://github.com/OpenGreekAndLatin/First1KGreek/blob/3f5519b9a01ca4ff5eb56048868e83844e7755ab/data/tlg0093/tlg005/tlg0093.tlg005.1st1K-grc1.xml#L12
to:
Notes:
respStmt
containing multiplepersName
elements to a 1:1 relationship betweenrespStmt
andpersName
.orgName
in eachrespStmt
Implementation
Extraction process
Each row in the
attributions-data
worksheet corresponds to a set of URNs extracted from the underlying XML files.There are "key" and "urn" fields which should not be modified and will be used to perform the bulk update.
Editing attribution data in the spreadsheet
I went through and made an initial pass to clean up the data. This involved fixing small typos in organization names, normalizing names (Mt. Allison vs Mount Allison, etc) and restructuring data to fit the desired model (discussed below).
The
unique-*
worksheets show uniquevalues for theresp
,orgName
andpersName
.Ideally, we can standardize on "Proofreading" vs "proofreader" vs "Proofreading and CTS conversion" as appropriate. If proofreading and CTS conversion are two distinct responsibilities for a given text, I would suggest:
1) Adding an additional row beneath "Proofreading and CTS conversion"
2) Edit the original
resp
to Proofreading3) Set the
resp
in the new row toCTS conversion
4) Copy the other relevant fields (
resp
,orgName
andpersName
) to the new row5) Leave a comment on the row so I can ensure that the
urn
andkey
fields are also populated.There are also several instances where slight variants in a person's name are used, or
resp
possibly contains data better suited fororgName
.We should not delete any rows; if there are duplicate rows in the spreadsheet, we'll use the
urn
andkey
fields to de-duplicate data.Bulk update process
Once edits have been finalized in the spreadsheet, I'll use the
urn
andkey
fields to map the edits back to the desired data model (see below)I will also perform a reordering of the desired "proofreading / conversion" role(s) so that they are weighted before any other roles.
I'll open up a PR and link it back to this issue. The PR can be merged and then the updated attributions will be made available on scaife.perseus.org
Closing thoughts
I'm not sure if there is "template" for future XML files, but I would also be happy to take the examples in Desired data model above and integrate them into that template.
As long as the XML files have
respStmt
withresp
and one ofpersName
ororgName
, we can extract attributions for display on scale.perseus.org.