direct-phonology / jdsw

Parsing the "Jingdian Shiwen" with spaCy
MIT License
2 stars 0 forks source link

clean annotations of commentaries from jdsw #10

Closed thatbudakguy closed 2 years ago

thatbudakguy commented 2 years ago

copied/adapted notes from 5/27 meeting:

  1. Look thru JDSW and break it up into k: v store, where each key is every unbroken sequence of characters prior to an annotation
  2. For each key: value pair... a. Look through the source text (same chapter) and find the first instance of the key (unbroken) that occurs after the previous annotation (annotations must be sequential) b. If that key is found and it's in the source text (not a commentary), leave it alone in the JDSW c. If that key is found and it's in the commentary (indicated in SBCK editions in brackets), drop it from the JDSW d. If that key isn't found at all, log it along with the previous and next annotations so that @GDRom can investigate manually

Assumption: If LDM annotates two successive characters, the second annotation refers to the instance of that character that is closest in the source text to the previous character.

this will produce a version of the JDSW that leaves out any annotations referring to commentaries, which we can later align to the 正文 versions.