Create initial Frictionless Data schema files and data packages for pemm data files in the data repository

kmcelwee commented 4 years ago

Start with the frictionless data python datapackage library to infer schemas for the existing CSV files in the data repository that we care about, and then see how much clean up is needed.

rlskoeser commented 4 years ago

@kmcelwee updated the description with some notes. I'll be interested to see how similar (or not) the frictionless data schema is to the internal json schema file we used to generate the spreadsheet.

kmcelwee commented 4 years ago

Sorry: We should have had the converstation here instead of https://github.com/Princeton-CDH/pemm-data/pull/2 To summarize that conversation:

The schema.json was created and lightly edited
RSK prefers that we not edit the JSON output by frictionless data
It doesn't need to match the structure in src/schema.json exactly

I've committed and pushed new changes, but I've left the PR as a draft. I think it's safe to say the schema is not in a publishable state. I did run a small script to note the differences between pemm-scripts/src/schema.json and what we have right now.

These sheets that have been added: macomber_incipits.csv & sheet2.csv

And here's a summary of the columns that have been added (the numbers here are blank columns):

canonical_story.csv
    Macomber Keywords
    CSM Number
    Clavis ID
    Translation of Story into English
    Translations; formerly English Translation
    field14
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    Macomber ID Number
    Macomber ID Letter
manuscript.csv
    vHMML permalink (pending on 04/25/2020)
    Columns per page
    Lines per column
    Characters per line
    Hamburg MS ID
    Latitude
    Longitude
    Place Recorded/Purchased
    Title from catalog
    Total Stories according to catalog
    Number of Paintings according to catalog
    Link to catalog
    Catalog has miracles records
    Can be used for sequence (miracles folio range matches with catalog)
    Mss rebound in disorder or there are breaks in the sequence of TM
story_instance.csv
    Best Incipit Tool Match
    Story Incomplete
    Blank TM folios
    Ethiopic Story Number
    Story Variation
    High Confidence Not IT
    Princeton Catalog Folios
    Princeton Catalog Titles
    Body of story start folio & line
    Macomber Incipit
    (test on whether there are two incipits in the ITool on the same folio)
    Test for whether the incipit is not unique
    New mss (column for sorting)
    Miracles sequence number
    Folio Start Number
    Folio Start Letter
    Temporary English Translation for TGS 1994, to be moved when ID'd
story_origin.csv
    field4
    Town/Country

kmcelwee commented 4 years ago

I think this was closed by https://github.com/Princeton-CDH/pemm-data/pull/4

rlskoeser commented 4 years ago

Agreed.

Princeton-CDH / pemm-scripts

Create initial Frictionless Data schema files and data packages for pemm data files in the data repository #54