dhlab-epfl / dhSegment

Generic framework for historical document processing
https://dhlab-epfl.github.com/dhSegment
GNU General Public License v3.0
370 stars 116 forks source link

Need a short guide of layout detection and line detection #49

Closed longwall closed 4 years ago

longwall commented 4 years ago

Hello, I have a large collection of scans of written text in table forms with complex layout structure and printed only vertical borders. My plan the a segmentation table rows cell by cell ,line detection inside each cell and then a trial of recognition. I passed through dhSegment demo,it'sok but met problems with operations. Could you please provide any examples of use cases described in the overview https://dhlab-epfl.github.io/dhSegment/ ? I'm ready to label training dataset from my collection but cannot get a start. Any notebook or video guide? One more question is about READ-BAD dataset that was suggested in a couple of issues discussions. I see the article PDF in arxive.org but didn't find a link to download the image collection. What did I miss?

solivr commented 4 years ago

Hello,

Sorry for the long delayed answer...

Here is a jupyter notebook we've made to show how to go through page and text lines extraction. Also this notebook on dropcap extraction may help. Regarding the READ-BAD dataset you can find it on Zenodo: https://zenodo.org/record/1491441

longwall commented 4 years ago

Hello, thanks for reply - and thanks for the tutorial! I'll try it very soon.

As I understood it can vectorize lines of written text? Is it related to your adjacent project ARU-NET with the underlining of written text?

longwall commented 4 years ago

I've just played a copy of your notebook. It's really cool, thank you very much! I'll move towards lines and words segmentation - finding their convex (maybe rectangular) envelope. Looks not so hard at first view.. )