4lex4 / scantailor-advanced

ScanTailor Advanced is the version that merges the features of the ScanTailor Featured and ScanTailor Enhanced versions, brings new ones and fixes.
GNU General Public License v3.0
1.18k stars 129 forks source link

Content anchoring #79

Open beefeater7 opened 5 years ago

beefeater7 commented 5 years ago

It occurred to me that we have little in the way of content alignment as it is. The contents of scanned and processed books shift around between page turns. Probably not hard to fix in Photoshop, but an automated solution is always nice.

One useful property of the printed page is the static element: be it the book title, chapter name, or the most common: page number, these landmarks serve as the perfect coordinates for absolute content positioning on each page. If we are aiming for consistency this is a step in the right direction.

Another way of anchoring the content may relate to the dewarping functionality. In documents containing adjusted blocks of text, the margins provide a clue to positioning as well as proportions throughout. We can assume these paragraphs share width from page to page.

So, to recap, these are potential points to "nail down", as constant coordinates for interpage processing:

The more anchors, the better precision of transformation. I think a simple height-width resize would suffice, given a preceding dewarp.

Anchoring

Just shooting this out there.

Piolie commented 5 years ago

I agree with you that ST is lacking an automatic way of aligning the content in the final page. For my use case, the best would be to fix the page number using the output.

At the moment it is possible to semiautomatically align the pages by creating guides and pressing Ctrl+Shift+double LMB around the page number. However, the algorithm relies on the page before processing, which can have artifacts that yield some inconsistencies that require manual adjustment.

Since all page numbers end up in the upper (lower) left (right) corner of each page, I would really appreciate a way of aligning the content in that area between all pages. Don't know how hard it would be to implement.

Mister-Teatime commented 4 years ago

Oh yes, this would be nice. And I very much agree with your statement that dewarping would need to happen first (see issue #85). If that is done, then there could be a relatively quick way to find left and right edges of content and align to that (either left or right, depending on odd/even numbers, or both, and scale the content/resolution accordingly. Up/down might be more tricky with some layouts, as there are books where the first page of a chapter has a different layout (heading might not be at the top border), and the last page on a chapter may not go all the way to the bottom -- but those could be handled by the user.

One alternative implementation would be to assume that the dewarp box spans the whole page and marks its corners -- that provides one rectangle per page. The location of content within that rectangle could then be kept as is, and the sizes of the various rectangles adjusted to be equal in the final output. This would have the advantage that no additional work is necessary after dewarping, but the disadvantage that dewarping (or at least marking/finding the page corners would be required.

Fancy way to unify this: Dewarping could leave "markers" on the page corners, and so does content selection etc.. The user could then select whether they want to use the existing markers (and which ones) to align pages, or to create new ones (e.g. on page numbers etc.), manually. This would be similar to the concept used by Hugin to align photos in a panorama, except with fewer markers, and the markers being "cheap" to produce.

zvezdochiot commented 4 years ago

@beefeater7 say:

One useful property of the printed page is the static element: be it the book title, chapter name, or the most common: page number

An interesting thought: use the indication of the page number to calculate the page margins. That is, fix the position of the page number on all pages and calculate the fields from this.