direct-phonology / jdsw

Parsing the "Jingdian Shiwen" with spaCy
MIT License
2 stars 0 forks source link

train a span categorizer on jeff & hantao's data #45

Closed thatbudakguy closed 1 year ago

thatbudakguy commented 1 year ago

using the categorization system established by tharsen & wang, we can see how spaCy's span categorizer performs on its own, irrespective of tokenization, etc.

it would probably be helpful to have a graphical interface for this via streamlit, for testing out arbitrary annotations and seeing what the model predicts.

thatbudakguy commented 1 year ago
thatbudakguy commented 1 year ago

Jeff & Hantao's span categories include very granular info on some kinds of content, but skip over other kinds we're interested in:

TAG_MAP = {
  "E": "E",     # headword
  "B": "T",     # book title
  "BC": "C",    # commentary on book title
  "F": "F",     # fanqie
  "T": "T",     # poem title
  "J": "T",     # juan number
  "C": "C",     # commentary on headword
  "CF": "F",    # fanqie reading for char in commentary
  "CC": "C",    # commentary on commentary
  "S": "T",     # section title
  "SC": "C",    # commentary on section title
  "SF": "F",    # fanqie reading for char in section title
  "SS": "T",    # sub-section title
  "SSC": "C",   # commentary on sub-section title
  "SSF": "F",   # fanqie reading for char in sub-section title
}

I determined that training a model based on this data doesn't really fit our research question. We can already identify fanqie without a model, and that's mostly what this data does, so there doesn't seem to be much point in pursuing it (except perhaps later to aid in detecting which of the characters in the headword is being annotated).