direct-phonology / jdsw

Parsing the "Jingdian Shiwen" with spaCy
MIT License
2 stars 0 forks source link

fix missing pages in the jdsw #14

Closed thatbudakguy closed 2 years ago

thatbudakguy commented 2 years ago

problem

in several cases, the text of the jingdian shiwen from kanripo (KR1g0003) is missing entire pages, which can lead to chapters in the text being "conjoined" (see below for example).

we probably need to address this by updating the text at the source (https://github.com/kanripo/KR1g0003) via pull request and manually inserting the correct text. there are scans of the SBCK edition on Wikimedia and CTEXT has a version with simplified characters for reference.

example

(古猛/反)仲夏(戸嫁反/下同)謂食(音/嗣)齊(才細反/下皆同)頒爵(音/班)必¶ 當(丁浪/反)媒氏(音/梅)而取(音娶本/又作娶)稽士(古兮/反)之烖(音/災)妖¶ 孽(又作蠥魚列反妖又作祅說文云衣服歌謡/草木之怪謂之祅禽獸蟲蝗之怪謂之蠥)螟(亡丁/反)螽¶ ()¶

¶ (苦浪反又音/剛又户剛反)與茵(音/因)縮二(所六/反)以犢(音獨本/亦作特)相朝(直/遥)¶ (反下及/注同)灌用(古亂反/注同)鬱鬯(丑亮/反)脯醢(上音甫/下音海)繁纓¶ # checklist ### liji - [x] 009/010 - [x] 028/029 - [x] 031/032 - [x] 036/037 - [x] 048/049 ### maoshi - [x] 008/009 - [x] 014/015 - [x] 016/017/018 - [x] 020/021 ### shangshu - [x] 020/021/022 - [x] 029/030/031 - [x] 038/039 - [x] 044/045 - [x] 057/058 ### yili - [x] 009/010 - [x] 014/015 ### zhouyi - [x] 043/044 - [x] 053/054/055 ### zhuangzi - [x] 001/002 - [x] 006/007 - [x] 017/018 ### zuozhuan - [x] 026/027 - [x] 028/029 ### bugs internal to txt files - [x] maoshi 003 - [x] maoshi 010 - [x] maoshi 012 - [x] maoshi 019 - [x] maoshi 020 - [x] maoshi 024 - [x] maoshi 025 (3) - [x] maoshi 026 - [x] maoshi 030 - [x] zhouli 001 - [x] zhouli 002 - [x] zhouli 003 (2) - [x] zhouli 004 - [x] zhouli 005 - [x] zhouli 006 (2) - [x] yili (4) - [x] liji (16) - [x] zuozhuan (23) - [x] gongyang (2) - [x] guliang (5) - [x] zhuangzi (10)
GDRom commented 2 years ago

Additionally: check for occurrences internal to JDSW text files

GDRom commented 2 years ago

Note: missing pages issue is more substantial than initially assumed, esp. regarding file internal issues. Will update list as I go through each jdsw file.

GDRom commented 2 years ago

All missing pages in relevant sections added.