blindpandas / bookworm

The Universally Accessible document Reader
https://getbookworm.com
Other
123 stars 38 forks source link

Remove page IDs when saving image to text or scanning to text using OCR #128

Open DraganRatkovich opened 2 years ago

DraganRatkovich commented 2 years ago

Is your feature request related to a problem? Please describe.

When saving an image to a text file or selecting the Scan to Text File option and selecting a scanned book for text extraction using OCR, Bookworm adds Page 1 Page 2 identifiers to the text file, which is useless in this case, because it doesn't help in any way when pasting this text into a Word document to automatically arrange the pages like in the previous document, Word will very easily do the rest of the work for itself, plus the additional font, paragraph style, line spacing will be applied to the text if the user of this would require, so writing in a text file Page 1 , Page 2 and the extra page brake character is very useless, no text format exporters, at least the popular ones like MSWord, Adobe PDF, do this.

Describe the solution you'd like

Simply extract pure text from a PDF file or image without adding a Word "page" and numbers, and a page brake symbol. @mush42 It will be very useful if fixed soon because saving as a text file of a pdf or word document will be increased many times and the text will be clean and smooth.

mush42 commented 2 years ago

Hello @DraganRatkovich

I may agree with removing the page numbering, but the page break char is semantically important, specially for OCR results.

Anyhow, I'll make text exporting customizable. A dialog box will be shown when exporting to plane text or scanning to text file.

Best Musharraf

DraganRatkovich commented 2 years ago

@mush42 Yes, it would be nice if checkboxes appeared during the save process in order to remove or save page brake symbols, etc.

DraganRatkovich commented 2 years ago

Hello @mush42 do you have any news on this issue?

mush42 commented 2 years ago

@DraganRatkovich Yes. the fix is coming.

DraganRatkovich commented 2 years ago

@mush42 Also, I didn't change the title, but please consider also adding options to select when saving any document in txt format, like from .pdf, docx, etc, not only when saving an image or scanning to text using OCR.