Princeton-CDH / pemm-scripts

scripts & tools for the Princeton Ethiopian Miracles of Mary project
Apache License 2.0
1 stars 0 forks source link

Revise schema and validation for added and revised fields #70

Closed rlskoeser closed 4 years ago

rlskoeser commented 4 years ago

Update our schema to reflect these new fields:

Story Instance sheet

Canonical Story sheet

rlskoeser commented 4 years ago

@elambrinaki thanks for the careful list in our meeting notes, it made it easy for me to add here. Please review to check that I've accurately documented the changes we agreed on. (The recension id is tracked separately on #65 )

elambrinaki commented 4 years ago

@rlskoeser May I please repeat the list here with small changes:

Canonical Story sheet

Manuscript sheet

Story Instance sheet

Story Origin sheet

Other columns are technical.

rlskoeser commented 4 years ago

@elambrinaki thanks for the updated list and for including all these details. A few follow-ups:

elambrinaki commented 4 years ago

@rlskoeser

Today we added another column on the Canonical Story sheet, called Earliest Attestation (the header note reads "The earliest century in which the story appeared."). It is an analytical field, and you recommended keeping analytical fields separately, but Wendy needs it in the sheet to have everything on one screen. For this column, we added four technical columns on the Story Instance sheet (Manuscript Date Range Start, Manuscript Date Range End, Story Century, Story Earliest Century). What do you think about it?

rlskoeser commented 4 years ago

@elambrinaki thanks for the response.

On the boolean vs 0/1 — to me, a checkbox/boolean is more semantic, and I think it would be more efficient for people entering content in the spreadsheet (whether you're using mouse or keyboard should be one click or keystroke). I would think less error-prone too, but adding validation to require 0 or 1 should handle that. Checkboxes would also be consistent with other fields. However, I did a quick test and it looks like applying a checkbox over the 0s and 1s loses your information (I thought maybe it would translate them, but that was overly optimistic). I think migrating to checkboxes would require converting all your 1s to TRUEs first, so I think we should leave it at this point (but perhaps could clean it up later).

On the numeric range: you may be right about fractions; I think we set the format to integer, but that might only impact display not actually require that it be an integer.

I ask about the technical columns because we need to decide how to handle them for the automated validation. My preference would be to include them if we can — and show you how to modify the schema that is used for validation when you add or change fields. But if that turns out to be too cumbersome, I think we can configure it to ignore columns that aren't defined — which would allow you to add and remove temporary columns as needed. My hope is that we can work on this together next week in tandem with the field changes.

elambrinaki commented 4 years ago

Thank you @rlskoeser!

Got it re the boolean, I will change it.

Sounds great re working together on the technical columns next week! A side question: If you trust us in modifying the schema with new fields, how do you feel about also giving us rights to change the order of columns?

rlskoeser commented 4 years ago

@elambrinaki great question — yes, we should empower you that way! We should discuss what exactly that looks like — but it's an excellent goal, especially as we wrap up this grant period, and I'm glad you brought it up.

rlskoeser commented 4 years ago

@elambrinaki FWIW, on the story origin sheet — it was our expectation (or at least it was mine) that you would use the town/country/continent as the location name. I've wondered why you're including the region in the name/label for the locations but never asked — was this important for how the catalogers are working with the locations? I would have thought it would be simpler to find locations by city or country and not worry about the region when doing data entry.

(Not that anything needs to be changed now — just curious.)

rlskoeser commented 4 years ago

@elambrinaki If all the Clavis IDs start with CAe you could think of leaving that out and making the field a pure numeric. Simpler for data entry and easier to work with (IDK how much you'll need to work with that field).

rlskoeser commented 4 years ago

@elambrinaki I made a test copy of the PEMM spreadsheet so I could check the revised fields and they aren't quite matching up. Here's what I'm noticing, please let me know how you want me to adjust:

elambrinaki commented 4 years ago

@rlskoeser I assume we would need to redefine the Named Ranges if we change the order of columns.

elambrinaki commented 4 years ago

@rlskoeser Yes, it was a miscommunication on our part. The location name is pulled from the Story Origin sheet to the Origin column on the Canonical Story sheet. But there, Wendy needs the information in the continent/country/town format. The focus is on whether a story is Arabic or European or Ethiopian, and a particular place is of secondary importance. The Origin column on the Story instance sheet is used not to distinguish stories, but to aggregate them. So we found a workaround by adding a Town/Country column and changing the order of columns in the concatenate function in the Name field to have it have a desired continent/country/town format.

rlskoeser commented 4 years ago

@elambrinaki if we use the google apps script code to apply the schema changes it will update the named ranges. This was one of the things I've been wondering about — it seems that you have made a lot of these changes manually, but not all of them. Is it easier for you to make the changes manually or do you want one last scripted update to reflect the current structure that you want in the spreadsheet? (I should have probably asked this sooner, but didn't really figure out the question until I was most of the way through making the updates. The apps script code revisions are almost done, pending your answers about the column mismatches.)

elambrinaki commented 4 years ago

@elambrinaki Re the Clavis IDs, It seems that Hamburg team care about making the term "Clavis Aethiopica (CAe)" well-known and well-understood, so I think it would be better to keep their IDs intact. Their regular IDs (Hamburg IDs) also have text in them (they all begin with "LIT" for instance).

elambrinaki commented 4 years ago

@rlskoeser Re the column headers being off:

**manuscript: headers are off after Century Numeric** The Century Numeric is a new (numeric) column. It is valuable on its own ("Century" is a text field for the dating information, and "Century Numeric" gives a specific number), and it is also used for the new field "Earliest Attestation" on the Canonical Story sheet.

canonical story: headers are off after earliest attestation and total records Total Records is a new column I didn't list here. We use it to see whether the database is short on the records of specific IDs, and also to assess whether we can trust the value in the Earliest Attestation column (e.g., if we have only one record of a particular ID from a 20th century manuscript, it doesn't mean the story was written in the 20th century).

story instance: headers are off after Best Incipit Tool Match; is Miracles sequence number a new column or one you plan to move? The column I didn't list here is the "New Ms". It is used for sorting (we want the mss the catalogers are currently working on to appear at the top of the sheet. At the same time, we want the other mss to be ordered alphabetically). It is a temporary numeric column. Also, since the time we spoke about columns, we have added four columns at the end of the Story Instance sheet. They are all numeric and analytical. They are needed for the "Earliest Attestation" column on the Canonical Story sheet,

elambrinaki commented 4 years ago

@elambrinaki if we use the google apps script code to apply the schema changes it will update the named ranges. This was one of the things I've been wondering about — it seems that you have made a lot of these changes manually, but not all of them. Is it easier for you to make the changes manually or do you want one last scripted update to reflect the current structure that you want in the spreadsheet? (I should have probably asked this sooner, but didn't really figure out the question until I was most of the way through making the updates. The apps script code revisions are almost done, pending your answers about the column mismatches.)

@rlskoeser Are you asking whether it is easier for us to create a column with validation in the Google sheet or update the google apps script code and then run PEMM --> Set up all sheet --> Set up validation? In any case we need to update the code to reflect the changes we made, right?

rlskoeser commented 4 years ago

@elambrinaki thanks for explaining the field mismatches. Should I add the fields you have explained above so that the schema matches the spreadsheet?

@elambrinaki yes, that's what I'm asking — creating columns & adding validation vs updating the apps script code and running the setup and validation steps. We should decide at our next meeting what handing off the spreadsheet looks like; my current (preliminary) thinking is that we would stop making changes with the google apps script code after that point, and make sure you know how to update the data validation schema to reflect any changes you make to the spreadsheet.

elambrinaki commented 4 years ago

@rlskoeser Aha, I now realize that I haven't described one technical column I should have described because it is in the middle of active columns. I am sorry! It is on the Story Instance sheet, called the Best Incipit Tool Match. The correct order of columns after the Notes field:

All the remaining columns are technical/analytical.

elambrinaki commented 4 years ago

@rlskoeser Yes, please add these fields to the schema.

Your plan to make further changes manually without running the script and then document them in the schema seems perfect to me.

elambrinaki commented 4 years ago

@rlskoeser Earliest Attestation and Total Records on the Canonical Story sheet are auto-generated fields. Does it mean no validation is needed?

elambrinaki commented 4 years ago

Manuscript sheet

Notes is the last active column, everything to the right from it doesn't need to be documented.

Canonical Story sheet

English Translation Source is the last important column, please ignore the other two to the right.

Story Instance sheet

Canonical Incipit is the last permanent column.

elambrinaki commented 4 years ago

@rlskoeser @WendyLBelcher Rebecca, thank you very much for allowing us to restructure our dataset! We are happy with its new look.

kmcelwee commented 4 years ago

@rlskoeser Safe to close?