direct-phonology / jdsw

Parsing the "Jingdian Shiwen" with spaCy
MIT License
2 stars 0 forks source link

implement conversion from middle chinese to old chinese #3

Open thatbudakguy opened 2 years ago

thatbudakguy commented 2 years ago

depends on direct-phonology/jdsw#2.

GDRom commented 2 years ago

Note: there are a few dozen characters that would trigger the following issue:

Example:

  1. 沈 chén, sink [v.t.]; MC drim < OCNR *C.[d]r[ә]m
  2. 沈 chén, sink [v.i.]; MC drim < OCNR *[d]r[ә]m

Such occurrences ought to pop up predominantly in initial/preinitial positions.

For OC implementation, I'd have to disambiguate manually.

thatbudakguy commented 2 years ago

It's fascinating that these types of subtle changes nearly always seem to have a syntactic or semantic correlate (the transitivity of the verb, here, which took me second to notice!)

Is it worth going through OCNR and pulling out all of these quasi-"minimal pairs" to see if we can come up with a rule? The reason I ask is because the annotation process, for everything that we annotate (phonology included) is I assume going to be "automated first with manual later", and so if we do that process for POS first, we can then use the POS information to make "smarter" initial predictions for the phonology.

I'm actually not sure how transitivity is represented in CoNLL-U (maybe that's the dependency parse?), so really these would both be VERB in the POS category anyway, but just thinking further about this. It'd at least help for the cases of polyphones in middle chinese where the POS (verb vs noun) actually can disambiguate further.

GDRom commented 2 years ago

I should have spelled the difference between transitive vs. intransitive out fully; my apologies!

As to your question -- not sure if it's worth it? Maybe? We should discuss the whole process in more detail.

Overall, I'd be tempted to keep the "simple" LDM-model clear of those assumptions (I don't think our friend in the 6th century cared for the difference between intransitive vs. transitive, as both were read the same and meant [largely] the same to him). Instead, we might potentially run into the issue of circular logic (as we'd take Baxter and Sagart's assumptions and built the entire model based on that).

We could, however, include the transitive vs. intransitive distinction in a full-on OCNR model; not sure where to put that in CoNLL-U either, though. UD-Kanbun does not distinguish between that, I think; implicitly, the dependency parse would provide that information (VERB followed by dependent NOUN is transitive; VERB without that is intransitive). We could as a consequence test how much better the OCNR model would do than the LDM-model (as in: does linguistics help us understand what's going on better than LDM's notes on 音義?)