train a span categorizer on jeff & hantao's data

thatbudakguy commented 1 year ago

using the categorization system established by tharsen & wang, we can see how spaCy's span categorizer performs on its own, irrespective of tokenization, etc.

it would probably be helpful to have a graphical interface for this via streamlit, for testing out arbitrary annotations and seeing what the model predicts.

thatbudakguy commented 1 year ago

[x] write a script that converts t&w's .tsv output to prodigy's json-lines format
[x] use the db-in recipe to load all of the data into a database
[x] use prodigy train to auto-infer the suggestion function for the spancat and test out training one
[ ] try out train-curve to see how the model responds to more or less data
[ ] add functions to the streamlit app to run the model on arbitrary input and display predictions with the span visualizer

thatbudakguy commented 1 year ago

Jeff & Hantao's span categories include very granular info on some kinds of content, but skip over other kinds we're interested in:

TAG_MAP = {
  "E": "E",     # headword
  "B": "T",     # book title
  "BC": "C",    # commentary on book title
  "F": "F",     # fanqie
  "T": "T",     # poem title
  "J": "T",     # juan number
  "C": "C",     # commentary on headword
  "CF": "F",    # fanqie reading for char in commentary
  "CC": "C",    # commentary on commentary
  "S": "T",     # section title
  "SC": "C",    # commentary on section title
  "SF": "F",    # fanqie reading for char in section title
  "SS": "T",    # sub-section title
  "SSC": "C",   # commentary on sub-section title
  "SSF": "F",   # fanqie reading for char in sub-section title
}

I determined that training a model based on this data doesn't really fit our research question. We can already identify fanqie without a model, and that's mostly what this data does, so there doesn't seem to be much point in pursuing it (except perhaps later to aid in detecting which of the characters in the headword is being annotated).

direct-phonology / jdsw

train a span categorizer on jeff & hantao's data #45