Revise schema and validation for added and revised fields

rlskoeser commented 4 years ago

Update our schema to reflect these new fields:

~update schema.json in pemm-scripts codebase and add any validation/formatting implied~
[ ] update frictionless data schema in pemm-data repo and check validation

Story Instance sheet

Story Incomplete (boolean)
Blank TM folios (text)
Ethiopic Story Number (text)
Story Variation (text)
Body of story start folio & line (text)
Miracles sequence number (number)

Canonical Story sheet

EMIP Summary (text)
Macomber Keywords (text)
CSM Number (text)
Clavis ID (text)
Translation of Story into English (text)
Translations; formerly English Translation (text)
Manuscript sheet
Hamburg MS ID (text)
Delamarter Manuscript (boolean)
Columns per page (number)
Line range per column (text; add before Lines per Column numeric field)
Lines per column (number)
Characters per line (number)
Latitude (number)
Longitude (number)
Place Recorded/Purchased (text)
Catalog Total Stories Note (text)
Catalog Total Stories (number)
vHMML permalink pending
Story Origin sheet
Town/Country (text)

rlskoeser commented 4 years ago

@elambrinaki thanks for the careful list in our meeting notes, it made it easy for me to add here. Please review to check that I've accurately documented the changes we agreed on. (The recension id is tracked separately on #65 )

elambrinaki commented 4 years ago

@rlskoeser May I please repeat the list here with small changes:

Canonical Story sheet

EMIP Summary (text, no validation)
Hamburg Titles (text, no validation)
Macomber Keywords (text, no validation)
CSM Number (number, less than or equal to 420, help text “Must be a number from 1 to 420.”)
Poncelet Number (number, less than or equal to 1783, help text “Must be a number from 1 to 1783.”)
Clavis ID (validation =regexmatch(to_text(K2), "^CAe [1-9]\d{3}$"), help text “CAe followed by a four-digit number.”)
Translation of Story into English (text, no validation)
Translations; formerly English Translation (text, no validation)

Manuscript sheet

Hamburg MS ID (text, no validation)
Macomber Manuscript (number, either 1 or 0)
Delamarter Manuscript (number, either 1 or 0)
Columns per page (number, validation =regexmatch(to_text(S2), "^[1-9]$"), help text "Must be a digit.")
Line range per column (text, no validation)
Lines per column (number, validation =regexmatch(to_text(T2), "^[1-9]\d?$"), help text "Must be a one or two-digit number.")
Characters per line (number, validation =regexmatch(to_text(T2), "^[1-9]\d?$"), help text "Must be a one or two-digit number.")
Latitude (number, validation =regexmatch(to_text(V2), "^-?(([1-8]\d?(.\d+)?)|9(.\d+)?|90(.0+)?)$"), help text "Must be a valid latitude between -90 and 90°.")
Longitude (number, validation =regexmatch(to_text(W2), "^-?(180(.0+)?|((1[0-7]\d)|([1-9]?\d))(.\d+)?)$"), help text "Must be a valid longitude between -180° and 180°.")
Place Recorded/Purchased (text, no validation)
Catalog (text, no validation)
Catalog Total Stories Note (text, no validation)
Catalog Total Stories (number)

Story Instance sheet

Body of Story Start (text, no validation) -- will be moved (issue #71)
Story Incomplete (boolean)
Blank TM folios (text)
Ethiopic Story Number (text)
Story Variation (text)
Miracles sequence number (number)

Story Origin sheet

Town/Country (text)

Other columns are technical.

rlskoeser commented 4 years ago

@elambrinaki thanks for the updated list and for including all these details. A few follow-ups:

The Macomber and Delamarter Manuscript fields that you have listed as numeric and 0 or 1, shouldn't they be boolean / checkboxes ?
We recently revised the latitude/longitude validation to be numeric instead of regex, makes sense to keep that?
We could do a numeric range instead of regex for the lines per column and characters per line fields if that helps any (but no need to change if this is working)
Please clarify what you mean by "other columns are technical".

elambrinaki commented 4 years ago

@rlskoeser

Macomber and Delamarter Manuscript fields. You are right, but one question so I understand better. What are the advantages of checkboxes compared to 0/1 dummy apart from checkboxes looking better?
Latitude/longitude validation: makes sense to keep numeric.
With numeric range for lines/columns, wouldn't it be possible to insert fractions? E.g. 5.7365 for the number of lines.
Re technical columns. On the Canonical Story sheet, for example, there are columns Macomber ID Number and Macomber ID Letter. We use them for sorting (because of letters in IDs, IDs are strings and sorting gives 1A/B/C-10-100-1000 order instead of 1A/B/C-2-3-4). A similar thing is on the Story Instance sheet: the folio number (e,g, 11r) is split into a numeric field (Folio Start Number) and a text field (Folio Start Letter) for sorting.

Today we added another column on the Canonical Story sheet, called Earliest Attestation (the header note reads "The earliest century in which the story appeared."). It is an analytical field, and you recommended keeping analytical fields separately, but Wendy needs it in the sheet to have everything on one screen. For this column, we added four technical columns on the Story Instance sheet (Manuscript Date Range Start, Manuscript Date Range End, Story Century, Story Earliest Century). What do you think about it?

rlskoeser commented 4 years ago

@elambrinaki thanks for the response.

On the boolean vs 0/1 — to me, a checkbox/boolean is more semantic, and I think it would be more efficient for people entering content in the spreadsheet (whether you're using mouse or keyboard should be one click or keystroke). I would think less error-prone too, but adding validation to require 0 or 1 should handle that. Checkboxes would also be consistent with other fields. However, I did a quick test and it looks like applying a checkbox over the 0s and 1s loses your information (I thought maybe it would translate them, but that was overly optimistic). I think migrating to checkboxes would require converting all your 1s to TRUEs first, so I think we should leave it at this point (but perhaps could clean it up later).

On the numeric range: you may be right about fractions; I think we set the format to integer, but that might only impact display not actually require that it be an integer.

I ask about the technical columns because we need to decide how to handle them for the automated validation. My preference would be to include them if we can — and show you how to modify the schema that is used for validation when you add or change fields. But if that turns out to be too cumbersome, I think we can configure it to ignore columns that aren't defined — which would allow you to add and remove temporary columns as needed. My hope is that we can work on this together next week in tandem with the field changes.

elambrinaki commented 4 years ago

Thank you @rlskoeser!

Got it re the boolean, I will change it.

Sounds great re working together on the technical columns next week! A side question: If you trust us in modifying the schema with new fields, how do you feel about also giving us rights to change the order of columns?

rlskoeser commented 4 years ago

@elambrinaki great question — yes, we should empower you that way! We should discuss what exactly that looks like — but it's an excellent goal, especially as we wrap up this grant period, and I'm glad you brought it up.

rlskoeser commented 4 years ago

@elambrinaki FWIW, on the story origin sheet — it was our expectation (or at least it was mine) that you would use the town/country/continent as the location name. I've wondered why you're including the region in the name/label for the locations but never asked — was this important for how the catalogers are working with the locations? I would have thought it would be simpler to find locations by city or country and not worry about the region when doing data entry.

(Not that anything needs to be changed now — just curious.)

rlskoeser commented 4 years ago

@elambrinaki If all the Clavis IDs start with CAe you could think of leaving that out and making the field a pure numeric. Simpler for data entry and easier to work with (IDK how much you'll need to work with that field).

rlskoeser commented 4 years ago

@elambrinaki I made a test copy of the PEMM spreadsheet so I could check the revised fields and they aren't quite matching up. Here's what I'm noticing, please let me know how you want me to adjust:

manuscript: headers are off after Century Numeric
canonical story: headers are off after earliest attestation and total records
story instance: headers are off after Best Incipit Tool Match; is Miracles sequence number a new column or one you plan to move?

elambrinaki commented 4 years ago

@rlskoeser I assume we would need to redefine the Named Ranges if we change the order of columns.

elambrinaki commented 4 years ago

@rlskoeser Yes, it was a miscommunication on our part. The location name is pulled from the Story Origin sheet to the Origin column on the Canonical Story sheet. But there, Wendy needs the information in the continent/country/town format. The focus is on whether a story is Arabic or European or Ethiopian, and a particular place is of secondary importance. The Origin column on the Story instance sheet is used not to distinguish stories, but to aggregate them. So we found a workaround by adding a Town/Country column and changing the order of columns in the concatenate function in the Name field to have it have a desired continent/country/town format.

rlskoeser commented 4 years ago

@elambrinaki if we use the google apps script code to apply the schema changes it will update the named ranges. This was one of the things I've been wondering about — it seems that you have made a lot of these changes manually, but not all of them. Is it easier for you to make the changes manually or do you want one last scripted update to reflect the current structure that you want in the spreadsheet? (I should have probably asked this sooner, but didn't really figure out the question until I was most of the way through making the updates. The apps script code revisions are almost done, pending your answers about the column mismatches.)

elambrinaki commented 4 years ago

@elambrinaki Re the Clavis IDs, It seems that Hamburg team care about making the term "Clavis Aethiopica (CAe)" well-known and well-understood, so I think it would be better to keep their IDs intact. Their regular IDs (Hamburg IDs) also have text in them (they all begin with "LIT" for instance).

elambrinaki commented 4 years ago

@rlskoeser Re the column headers being off:

**manuscript: headers are off after Century Numeric** The Century Numeric is a new (numeric) column. It is valuable on its own ("Century" is a text field for the dating information, and "Century Numeric" gives a specific number), and it is also used for the new field "Earliest Attestation" on the Canonical Story sheet.

canonical story: headers are off after earliest attestation and total records Total Records is a new column I didn't list here. We use it to see whether the database is short on the records of specific IDs, and also to assess whether we can trust the value in the Earliest Attestation column (e.g., if we have only one record of a particular ID from a 20th century manuscript, it doesn't mean the story was written in the 20th century).

story instance: headers are off after Best Incipit Tool Match; is Miracles sequence number a new column or one you plan to move? The column I didn't list here is the "New Ms". It is used for sorting (we want the mss the catalogers are currently working on to appear at the top of the sheet. At the same time, we want the other mss to be ordered alphabetically). It is a temporary numeric column. Also, since the time we spoke about columns, we have added four columns at the end of the Story Instance sheet. They are all numeric and analytical. They are needed for the "Earliest Attestation" column on the Canonical Story sheet,

elambrinaki commented 4 years ago

@elambrinaki if we use the google apps script code to apply the schema changes it will update the named ranges. This was one of the things I've been wondering about — it seems that you have made a lot of these changes manually, but not all of them. Is it easier for you to make the changes manually or do you want one last scripted update to reflect the current structure that you want in the spreadsheet? (I should have probably asked this sooner, but didn't really figure out the question until I was most of the way through making the updates. The apps script code revisions are almost done, pending your answers about the column mismatches.)

@rlskoeser Are you asking whether it is easier for us to create a column with validation in the Google sheet or update the google apps script code and then run PEMM --> Set up all sheet --> Set up validation? In any case we need to update the code to reflect the changes we made, right?

rlskoeser commented 4 years ago

@elambrinaki thanks for explaining the field mismatches. Should I add the fields you have explained above so that the schema matches the spreadsheet?

@elambrinaki yes, that's what I'm asking — creating columns & adding validation vs updating the apps script code and running the setup and validation steps. We should decide at our next meeting what handing off the spreadsheet looks like; my current (preliminary) thinking is that we would stop making changes with the google apps script code after that point, and make sure you know how to update the data validation schema to reflect any changes you make to the spreadsheet.

elambrinaki commented 4 years ago

@rlskoeser Aha, I now realize that I haven't described one technical column I should have described because it is in the middle of active columns. I am sorry! It is on the Story Instance sheet, called the Best Incipit Tool Match. The correct order of columns after the Notes field:

Best Incipit Tool Match (text)
Story Incomplete (boolean)
Blank TM folios (text)
Ethiopic Story Number (text)
Story Variation (text)
Body of Story Start (text, no validation) -- will be moved (issue #71)
Miracles Sequence Number (number)

All the remaining columns are technical/analytical.

elambrinaki commented 4 years ago

@rlskoeser Yes, please add these fields to the schema.

Your plan to make further changes manually without running the script and then document them in the schema seems perfect to me.

elambrinaki commented 4 years ago

@rlskoeser Earliest Attestation and Total Records on the Canonical Story sheet are auto-generated fields. Does it mean no validation is needed?

elambrinaki commented 4 years ago

Manuscript sheet

[new] MS Status (dropdown: Complete,Irrelevant,Incomplete: almost done,Incomplete: being cataloged,Incomplete: has print catalog,Incomplete: not started,Incomplete: other,Unknown)
[validation changed] Catalog Total Stories (text)
[new] Include in the Analysis (boolean)
[new] Century Numeric (auto-filled, so no validation?; expected output is a number)
[name changed] Link to Digital Copy (was Link)
[new] Total Scans (numeric, validation: number > 0, help text "Must be a positive integer.")
[new] Total TM Paintings (auto-filled, so no validation? expected output is a number)
[new] TM Paintings (dropdown: Yes,No,Unknown)

Notes is the last active column, everything to the right from it doesn't need to be documented.

Canonical Story sheet

[new] Princeton Titles (text)
[new] Earliest Attestation (auto-filled, so no validation?; expected output is a number)
[new] Total Records (auto-filled, so no validation?; expected output is a number)
[new] Total Paintings (auto-filled, so no validation?; expected output is a number)
[name changed] English Translation Link (was Translation of Story into English)
[name changed] English Translation Source (was Translations; formerly English Translation)

English Translation Source is the last important column, please ignore the other two to the right.

Story Instance sheet

[new] Scan Start (validation =regexmatch(to_text(index(B:AAG, row(), column())), "^[1-9]+\d*[ab]?$"), help text: "Scan must be a number optionally followed by "a" or "b".")
[new] Body of Story Start (text)
[new] Scan End (validation =regexmatch(to_text(index(B:AAG, row(), column())), "^[1-9]+\d*[ab]?$"), help text: "Scan must be a number optionally followed by "a" or "b".")
[new] Incipit Full (text)
[new] ID Extension (validation: (validation: list from a range Manuscript!A2:A1000, help text: "Manuscript ID for the canonical version of this story variation")
[name changed] Story Word Variation (was Story Variation)
[new] Number of Paintings (dropdown: 0,1,2,3,4,5,6,7,8)
[name changed] Painting Note (was Number of Paintings)

Canonical Incipit is the last permanent column.

elambrinaki commented 4 years ago

@rlskoeser @WendyLBelcher Rebecca, thank you very much for allowing us to restructure our dataset! We are happy with its new look.

kmcelwee commented 4 years ago

@rlskoeser Safe to close?

Princeton-CDH / pemm-scripts

Revise schema and validation for added and revised fields #70

Story Instance sheet

Canonical Story sheet

Manuscript sheet

Story Origin sheet