parse annotations using a model

thatbudakguy commented 1 year ago

annotations seem to have a reliable internal structure which might lend itself well to dependency parsing. perhaps we can define custom terms (for fanqie, qualifiers, citations, etc.) and then use a dependency parser to automatically parse out the interesting parts of each annotation.

thatbudakguy commented 1 year ago

this might be a better fit for spacy's new SpanCategorizer. maybe we can use this example project and some smart pre-annotating to bootstrap a model for parsing annotations. we could extract:

fanqie, e.g. AB反
yin, e.g. 音X
qualifiers, e.g. ...下同, ...注同,
notes about textual variation, e.g. 本又作..., 本亦作..., [work]作...
the mystery 如字
references to other commentaries using 云
conjunctions, e.g. ...或..., ...又...
semantic indicators, e.g. X也 and AB之[A|B]
citations with numbers, e.g. 凡三篇正二攝一
卦?
徐?

thatbudakguy commented 1 year ago

approach is now outlined in docs/pipeline.md; closing in favor of more specific tickets.

direct-phonology / jdsw

parse annotations using a model #32