direct-phonology / jdsw

Parsing the "Jingdian Shiwen" with spaCy
MIT License
2 stars 0 forks source link

automatically align jdsw with cleaned source texts #11

Open thatbudakguy opened 2 years ago

thatbudakguy commented 2 years ago

@GDRom has already done some work to do this manually; we want to see if we can automate it.

10 is a prerequisite for getting the JDSW in shape to align.

9 is a prerequisite for getting 正文 versions to align to.

this uses a modified version of the algorithm from #10:

  1. Look thru cleaned JDSW from #10 and break it up into k: v store, where each key is every unbroken sequence of characters prior to an annotation
  2. For each key: value pair... a. Look through the source text (same chapter) and find the first instance of the key (unbroken) that occurs after the previous annotation (annotations must be sequential) b. If that key is found, take the JDSW annotation and insert it into the source text at that point c. If that key isn't found, just skip since we'll already know about it from #10
GDRom commented 2 years ago

Following up on this, as I will soon approach 2.d) from #10, to look into what's going on with the keys that couldn't be assigned.

Ideally, I would identify an underlying logic to that issue. If not, however, but I'd identify individual keys within source text, should I already insert those also into relevant source texts in /out?

thatbudakguy commented 2 years ago

last night I ran the algorithm from #10 on everything in out/jdsw (except the laozi, which we don't have an sbck edition of). if you take a peek at those files, you should see in the third column a note about whether the jdsw annotation matched the source text (in the sbck edition), the commentary (in the sbck edition), or wasn't found. taking a close look to see if the algorithm seems to be correct (and why things aren't found) would be super helpful at this point.

after that, depending on what you find, I'll implement the logic in this issue (which should be pretty similar to #10) in another script. when that script runs, it'll copy any of the annotations from out/jdsw that have the "source" note in the third column and paste them into the MISC column for the corresponding token in a new CoNLL-U file, which will be taken from out/zhengwen (i haven't generated all of these yet but a few tests are there). the output from this will go into out/aligned (similar to the manual alignment that you already did, but in CoNLL-U form).

GDRom commented 2 years ago

Sounds perfect. I'll take the time this weekend and/or early next week to take a deep dive into this, and will keep you posted on how well that algorithm does.

Just following up on this: "except the laozi, which we don't have an sbck edition of" -- you might have overlooked this SBCK edition thereof?

thatbudakguy commented 2 years ago

oh — indeed I did! the script that converts it was looking for a file named something like 001.txt, so it skipped over the one we have called 1.txt. hence no cleaned version of the laozi. I'll fix that, thank you!

GDRom commented 2 years ago

That must have been my mistake -- sorry about the misnomer there!

thatbudakguy commented 2 years ago

note to self: it's worth trying the needleman-wunsch global alignment algorithm here, just to see how it performs vs our homegrown one.