OCR: bug in certain layouts #136

Open eroux opened 2 years ago

eroux commented 2 years ago

This is something that we can live with and is probably very difficult to fix in an automatic way, but when a pecha has an illustration in the middle (like in this image), the OCR considers the left part of the image to be one paragraph, and the right part to be another one, while in fact this is one long paragraph where each line is split in two. Here's the result for this specific page:

/oM swa sti/_skyabs 'gro yan lag drug pa 'di la don bzhi las/_dang po
yan lag drug pa/_gnyis pa 'gyur gyi phyag dkon mchog gsum la phyag 'tshal
dang |_gzhung don dngos bshad pa'o/_/dang po ni/_skyabs 'gro yan lag drug pa bshad//
yul dus dgos pa dang /_/phan yon bslab bya blang ba rnams/_/rtsa ba sdom
dang dkon mchog gsum gyi yon tan dran pas skyabs su 'gro/_/gnas gsum bstan
mtshan gyi don/_rgya gar skad du/_Sha TAng+ga sha ra NaM/_bod skad du/_skyabs 'gro
lo/_/gsum pa gzhung gi don la gnyis/_brtsam par dam bca' ba bsdus don dang bcas pa
gnyis pa la gnyis/_mdor bstan pa dang /_rgyas par bshad pa'o/_/dang po ni/_cis 'gro
du bshad pa yin/_/gnyis pa rgyas par bshad pa la rim par/_'khor ba la/_'jigs
pa'i yul de la/_/'di nas byang chub thob bar ro/_/mu stegs spyod las log
rtsa ba'i chos sde