Closed thatbudakguy closed 1 year ago
.tsv
output to prodigy's json-lines formatdb-in
recipe to load all of the data into a databaseprodigy train
to auto-infer the suggestion function for the spancat and test out training onetrain-curve
to see how the model responds to more or less dataJeff & Hantao's span categories include very granular info on some kinds of content, but skip over other kinds we're interested in:
TAG_MAP = {
"E": "E", # headword
"B": "T", # book title
"BC": "C", # commentary on book title
"F": "F", # fanqie
"T": "T", # poem title
"J": "T", # juan number
"C": "C", # commentary on headword
"CF": "F", # fanqie reading for char in commentary
"CC": "C", # commentary on commentary
"S": "T", # section title
"SC": "C", # commentary on section title
"SF": "F", # fanqie reading for char in section title
"SS": "T", # sub-section title
"SSC": "C", # commentary on sub-section title
"SSF": "F", # fanqie reading for char in sub-section title
}
I determined that training a model based on this data doesn't really fit our research question. We can already identify fanqie without a model, and that's mostly what this data does, so there doesn't seem to be much point in pursuing it (except perhaps later to aid in detecting which of the characters in the headword is being annotated).
using the categorization system established by tharsen & wang, we can see how spaCy's span categorizer performs on its own, irrespective of tokenization, etc.
it would probably be helpful to have a graphical interface for this via streamlit, for testing out arbitrary annotations and seeing what the model predicts.