flxzt / rnote

Sketch and take handwritten notes.
https://rnote.flxzt.net
GNU General Public License v3.0
7.88k stars 267 forks source link

Retain text when importing PDF's #668

Open chron-isch opened 1 year ago

chron-isch commented 1 year ago

Hey, I just stumbled upon rnote a couple of weeks ago and it's an amazing project. Thanks you for the work!

Is your feature request related to a problem? Please describe.
I went through great pain to scan and OCR almost every document/script/lecture note I own to make them searchable. I usually take notes or highlight parts of those documents for later, but the moment I import them, they loose all OCR information/text and become nothing more than a fancy picture. This is especially annoying with long lecture notes, since I can't search them anymore.

Describe the solution you'd like
Do not rasterize PDFs or maybe use the PDF as background/just reference the file like xournalpp does and somehow merge both notes and PDF on export? Any way that keeps the original information contained within the file alive is fine with me.

Describe alternatives you've considered
I considered just sending the exported PDFs through OCR again, but my handwriting/highlighting/doodling makes OCR more difficult and error prone.

Thank you!

LeSnake04 commented 1 year ago

related to #153

I think it makes sense to discuss the feature here instead since 153 is already pretty bloated

bamonroe commented 1 year ago

This just bit me today. I presumed that importing a PDF would keep it as a vector graphic. However, the PDF was rasterized and a 2MB document became a 50MB document after exporting. Until this is fixed, I have to go back to Xournalpp, no one wants to email 50MB pdfs. I think there are lots of good suggestions in the other issue - using the imported pdf as a background, etc. It would be great to see some traction on this.

flxzt commented 1 year ago

761 restores the functionality that Pdf pages are exported in a vectorized format (for Pdf and Svg export). Retaining Pdf text is a bit more complicated, but I'll look into it if it can be done somehow.

A dedicated Pdf annotation mode is something that I would like as well, but I will track the progress for that feature in the other issue.

flxzt commented 9 months ago

The reason why it currently is not supported is that the pdf page content is converted to Svg and simplified when it is imported as a vector image.

There are two main reasons for this: simplifying reduces the render workload in some cases, and more importantly: when simplifying, the glyphs are converted to Svg paths. If they would be retained, it is common that their ID's clash when the page images are combined when the document gets exported. This results in nonsensical text.

Another solution could be: parse the Svg but instead of simplifying it, only prepend all matching ID's with a random string. This way there wouldn't be any clashes but the original glyphs/text is still retained. We'd need to test if the image rendering workload would still be acceptable, of course.

EDIT: looks like there is progress towards writing text back with usvg (resvg #682) so that would be a major step towards being able to simplify the svg, resolve clashing element ID's and retaining text

flxzt commented 6 months ago

With usvg v0.40 text is now retained. However, poppler still draws the glyphs as paths when rendering pdf pages to cairo. That's the only blocker left for retaining text in the export.

lokman2k5 commented 1 month ago

With usvg v0.40 text is now retained. However, poppler still draws the glyphs as paths when rendering pdf pages to cairo. That's the only blocker left for retaining text in the export.

are there any changes regarding this matter? I'd like to be able to select text in a PDF, like xournalpp