Original Training image with XML labels to extract data from documents

dhlab-epfl / dhSegment

Generic framework for historical document processing

https://dhlab-epfl.github.com/dhSegment

GNU General Public License v3.0

370 stars 116 forks source link

Original Training image with XML labels to extract data from documents #17

Closed Omua closed 5 years ago

Omua commented 5 years ago

Hi,

I'm working in a page layout analysis and information extractor and I found that dhSegment might work ok in this task. However, I don't know exactly if dhSegment can work with XML-based anotations (TextRegion, SeparatorRegion, TableRegion, ImageRegion, points defining bounds of each region...) for training besides the RGB styled section definitions. I see in the main page of the project that there is a Layout Analysis example under Use Cases section. That is the case that most resembles to the one I want to implement. Also, I want to extract text from the detected regions.

How can I do that? Can I still use dhSegment or I have to implement my own detector?

Thanks.

Regards.

solivr commented 5 years ago

Hi, dhSegment takes as input a pair of images : the original image and the labelled image where the regions you want to extract are annotated with different 'colors'. It is not restricted to any format of annotation, as long as you are able to convert it to the above-mentioned labelled image. So to answer your question, if you want to input directly XML files to dhSegment, no it will not work, but if you generate the corresponding labelled images, then yes, you'll be able to train a model. There are already some implemented functions to parse files with PAGE-XML format and generate the corresponding masks in the PAGE.py file. You can also have a look at the exps/diva/utils.py file that may give you some hints on how to adapt it to your specific experiment (the Layout Analysis example is the DIVA experiment with DIVA-HisDB data).

Omua commented 5 years ago

Ok, thanks! Right now I'm using the page.py functions to analyze de XML files I have currently, to labeled image that dhSegment takes as input. After that, I should be able to train the system to recognize the type of documents I need to analyze.
But what about extracting the text to postprocess it and analyze what is written? Is that possible?

Omua commented 5 years ago

After thinking about the last question I made, I think I have the solution.
After training dhSegment, the output will be the page regions classified by different colours. After that, I have to analyze that image. Having known beforehand which colour corresponds to which element, I can take the coordinates and extract it from the original image. Only then I can analyze it properly because I know exactly what type of information is in that region (table, image, text...)

Aminfaraji commented 2 years ago

how train dhsegment using own dataset?