AjaxMultiCommentary / ajmc-pipeline

Codebase for AjaxMultiCommentary
https://ajaxmulticommentary.github.io/ajmc-pipeline/
GNU Affero General Public License v3.0
4 stars 0 forks source link

how to inject ToC information into `CanonicalCommentary`? #5

Closed mromanello closed 1 year ago

mromanello commented 1 year ago

Hi @sven-nm

So this is what I created the other day for Lobeck's commentary:

[
    {
    "commentary_id": "bsb10234118",
    "toc":[
        {
            "section": "Praefatio", 
            "pages" : {
                "start": "0009",
                "end": "0014"
            }
        },
        {
            "section": "Ajax Hypothesis", 
            "pages" : {
                "start": "0017",
                "end": "0020"
            }
        },
        {
            "section": "Ajax text", 
            "pages" : {
                "start": "0021",
                "end": "0080"
            }
        },
        {
            "section": "Commentarius", 
            "pages" : {
                "start": "0081",
                "end": "00498"
            }
        },
        {
            "section": "Addenda et corrigenda", 
            "pages" : {
                "start": "0499",
                "end": "0504"
            }
        },
        {
            "section": "Index I. Rerum et vocabulorum", 
            "pages" : {
                "start": "0507",
                "end": "0505"
            }
        },
        {
            "section": "Index II. Scriptorum", 
            "pages" : {
                "start": "0519",
                "end": "0520"
            }
        }
    ] 
}
]

Related questions:

sven-nm commented 1 year ago

Ciao @mromanello !

Many thanks for this first exploration.

What does the workflow look like for importing them?

Section or chapter are ordinary text containers, they just need their own RawObject and CanonicalObject, which will be children to TextContainer. It's a two-line-of-code story.

As to the creation, I'd go for a manual process (with ~15 commentaries we can do afford that)

Definitely !

Where do we store this data (before injection and after injection)?

Before injection: base_dir/comm_id/olr/sections.json ? After injection: Simply in the canonical.json, as a textcontainer.

What sanity checks to do before "accepting" this ToC? E.g. values in start and end must correspond to real page IDs?

Do we need validation for such a small, manually annotated dataset ? If necessary I would just assert comm.id + '_' + section['start']' in [p.id for p in comm.children.pages]

How to access the ToC in the Python API?

As a normal textcontainer (commentary.children.sections)

This being said I would go for a more generic ontology of section types, like introduction, commentary... We will be happy to access all our commentary sections by the same name (and not comm.children.commentarius for Lobeck but comm.children.kommentar for Wecklein). I would hence go for :

[  # Commentary is not necessary as it is going to be the name of the file
  {
      "section_type": "index",
      "section_title": "Index II. Scriptorum",  # Optional, some of them are not named ! 
      "start": "0519",  # "page" doesn't seem to be necessary in my opinion
      "end": "0520"
  },
...
]

Suggested ontology:

[
  "preface",
  "introduction",
  "hypothesis",
  "text",  # Possibly more fine-grained (translation, primary)
  "commentary",
  "index",  # Possibly more fine-grainded (locorum, siglorum...)
  "Appendix",
  ...
]
mromanello commented 1 year ago

Hi @sven-nm

I'm doing a few more ToCs just to see whether the above schema + ontology works consistently.

I propose to make section_type a list of strings instead of a string. This would allow us to properly label sections of commentaries that belong to multiple types in our taxonomy. For example, the section Ajax in De Romilly's commentary, should have both text and commentary, since text and commentary are on the same page.

What do you think?

mromanello commented 1 year ago

I've done the following ToCs: