direct-phonology / jdsw

Parsing the "Jingdian Shiwen" with spaCy
MIT License
2 stars 0 forks source link

better handling for cases of multiple readings #17

Open thatbudakguy opened 2 years ago

thatbudakguy commented 2 years ago

right now there are a variety of cases where Reconstruction can raise a MultipleReadingsError: when fetching an initial, a rime, or an entire reading. for at least some of these cases, I think we could do something a little smarter. an example:

  1. we see a 長 in the text without an annotation from LDM, and go to the guangyun looking for a reading.
  2. we see that 長 has three available readings: drjangH, drjang, trjangX.
  3. we divide each of these into initial, rime, and tone (see #6).
  4. for the initial, we have two options: dr and tr.
  5. for the rime, we have only one option, which we can confidently annotate: jang.
  6. for the tone, we have three options: level, rising, and departing.

there's still ambiguity here, but much less ambiguity than simply giving up and not assigning a reading! if we can come up with a systematic way of noting the ambiguity, as B&S do for their OC reconstruction (using things like brackets), we might still salvage some information that would help an algorithm or a human manually correcting the data. for example:

[dr|tr]jang[X|H|_]

and if we annotate each part in a separate field, this might make it into the CoNLL-U as:

MCInitial=[dr/tr]|MCRime=jang|MCTone=[X/H/_]

(using the / instead of | since that character is reserved to separate annotations in CoNLL-U MISC and FEATS fields.)

this also helps in the (unfortunately many) cases where LDM did provide an annotation, but one or both of the characters in his fanqie happen to be polyphones.

GDRom commented 2 years ago

This sounds like a brilliant solution when compared to our previous approach. And you are right, this makes things for a human reader much clearer, as the structure you are proposing inherently draws attention to what's unclear.

Also, just to follow up on LDM, as this logically would result in either of the two things:

  1. the character LDM provided is still ambiguous, as the relevant syllable segment, for example MCInitial=[dr/tr], is ambiguous
  2. the character is not ambiguous anymore, as LDM refers to the syllable segment that is clear, for example MCRime=jang|MCTone=X