direct-phonology / jdsw

Parsing the "Jingdian Shiwen" with spaCy
MIT License
2 stars 0 forks source link

output list of variant readings occuring in JDSW but not attested in SBGY #16

Open GDRom opened 2 years ago

GDRom commented 2 years ago

There are instances in which LDM's JDSW correctly notes a reading not included in the SBGY. Our current approach fails to take these instances into account. These instances are rare, however, and often tied to archaic texts (like the Shangshu).

Examples thus far encountered include:

Suggested approach:

thatbudakguy commented 2 years ago

Right now this is controlled by: https://github.com/direct-phonology/jdsw/blob/50b6e6f50673891f5f880a18ff95f49e04dca472/bin/lib/phonology.py#L70-L71

where readings_for() just checks our SBGY file, so I could easily have it output a warning when rejecting things.

Do you think it's appropriate in this case to just add the readings to your GDR-SBGY-full.csv? If not, probably not too difficult to expand the Reconstruction class in lib/phonology.py to allow augmenting the sound table with more information after it's constructed.

GDRom commented 2 years ago

Thanks for pointing to the right line of code for this. Yeah, adding such a warning output would be great so I can analyze where things clash between the SBGY and the JDSW.

As for whether or not to append these readings to GDR-SBGY-full.csv -- I think better not, as I think that data should be left as is. The example I provided above should not occur in the vast majority of medieval texts, nor in "regular" Han texts, but only in that domain of texts that is decidedly archaic (shangshu, maoshi, yijing etc.). I'd estimate that the same will be true for most, if not all, "correct" readings that are omitted in the SBGY. I'd hence say those are clear exceptions; perhaps a separate readings_exceptions.csv or so might be a better place to store those?

thatbudakguy commented 2 years ago

cool, readings_exceptions or readings_archaic or something makes sense to me.