OCR4all / LAREX

A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
MIT License
179 stars 33 forks source link

cut multiple line segments #262

Open alexander-winkler opened 3 years ago

alexander-winkler commented 3 years ago

Hello!

This is a small feature request originating from my work with OCR4all/LAREX:

Line segmentation isn't always perfect. For some reason (maybe this can be avoided tweaking the preferences) a bunch of lines gets not segmented properly, for example:

cut_multiple_line_region

As this happens rather often, drawing new rectangles and adding them to the reading order can become time-consuming, so I was wondering if you could add something like the cut line function (cut_function) in the Segments mode to the Lines mode as well.

Possible behaviour:

  1. Select cut tool
  2. Select multi-line TextLine-Element
  3. Draw one or multiple more or less horizontal lines that cut the entire TextLine-Element
  4. Add newly created elements to reading order

A similar function for vertical segmentation would be useful as well, but reorganizing the reading order is definitely more difficult.

Thank you!

maxnth commented 3 years ago

Hi,

Line segmentation isn't always perfect. For some reason (maybe this can be avoided tweaking the preferences) a bunch of lines gets not segmented properly, for example:

Line segmentation in OCR4all isn't really implemented optimally at the moment and while – as you said – one can often improve the results with parameter tweaking this doesn't work always. The upcoming release of OCR4all will feature refactored code for the line segmentation and will hopefully improve the line segmentation.

As this happens rather often, drawing new rectangles and adding them to the reading order can become time-consuming, so I was wondering if you could add something like the cut line function in the Segments mode to the Lines mode as well.

Would the subtract rectangle / subtract polygon work for your use case (see video)?

https://user-images.githubusercontent.com/33344081/123290648-bc956600-d511-11eb-8516-8f300eda3a81.mp4

I just quickly looked into adding a cut-from-line (instead of rectangle / polygon) feature into LAREX but Paper.js doesn't seem to like intersecting / dividing open paths like lines and closed paths like polygons (the cut function in Edit and Segments doesn't work "on the fly" via Paper.js but through the backend) but I'm probably just missing something so this might still get added as soon as I figure it out.

Add newly created elements to reading order

Great idea, I guess adding a toggle for that would make a lot of sense for the current subtraction features as well.

A similar function for vertical segmentation would be useful as well, but reorganizing the reading order is definitely more difficult.

Ordering the newly created segments (through subtraction or division) by lowest x or y coordinate (determined by the state of the added toggle) might probably work for most vertical / horizontal segmentation, wouldn't it?

alexander-winkler commented 3 years ago

Hello!

Would the subtract rectangle / subtract polygon work for your use case (see video)?

This could work for series of use cases, I guess. Thanks for this idea! One will probably have to adjust the two resulting polygons, but that is not terribly cumbersome. Maybe one might add a polygon reduce function. If not closed, you could automatically add for each point x/y a point x/y-1px, thus mimicking a cut function that is not implemented in Paper.js.

Add newly created elements to reading order

Great idea, I guess adding a toggle for that would make a lot of sense for the current subtraction features as well.

Very much in favour of this idea!

A similar function for vertical segmentation would be useful as well, but reorganizing the reading order is definitely more difficult.

Ordering the newly created segments (through subtraction or division) by lowest x or y coordinate (determined by the state of the added toggle) might probably work for most vertical / horizontal segmentation, wouldn't it?

I'm not sure how this would work out on a skewed page with two-column layout. In any case, one could also think of a possibility of moving multiple lines in the reading order batchwise (select group of lines, move them to a specific position in the reading order). More generally, however, I would advocate for a "redo reading order" function. When I add several new lines, it would be easier to have the reading order recognized once again instead of manually adding the new lines.