Closed ElectricRCAircraftGuy closed 3 years ago
@michaelsjackson, I don't see the problem at all on any converted PDF, and I've been using them for a long time (months? years?) regularly. I've never seen the problem you show once. I think it's an Evince problem. Try some other PDF readers on Linux and see what happens. My preferred PDF reader, hands-down, by far, is Foxit Reader. It works on Windows, Mac and Linux, but is NOT free software. It is no cost, however, for the basic version, which is the best PDF reader for Linux I've ever seen. Here's a screenshot:
In the screenshot above, I've converted this PDF (https://homepages.cwi.nl/~storm/teaching/reader/Dijkstra68.pdf ) to be searchable, using my tool. I then used Foxit Reader to underline in red and higlight in yellow on the left, and I saved the PDF. The highlighted text on the right is just text I have selected. As you can see, it works and looks fine.
Note: the screenshot was taken with Shutter, which is the best screenshot tool by far in Linux, in my opinion. Install in Ubuntu 18.04 with sudo apt install shutter
, then do this. Or, for Ubuntu 20.04.
I guess using the hOCR option in tesseract might help.
I think this only allows you to save a metafile with text outside the PDF, so I don't think it applies in this situation. I think the problem is just with your PDF reader. Try something else, like Foxit Reader.
Update: in the "Document Viewer" (Evince), which comes with Ubuntu 18.04, I see what you're seeing now:
I still think this is a bug in their software. Please file a bug report with them and post a link to it here and I can go upvote it or whatever too. Linux PDF viewers seem to be obsolete and don't work well and I don't like them much, with the exception of Foxit Reader.
@michaelsjackson, I just filed a bug report on Evince's gitlab page, here: https://gitlab.gnome.org/GNOME/evince/-/issues/1478. Please go there and upvote it to get it some attention. :+1:
Thanks for posting, I hit the up button there, I guess this is what you meant with upvoting. I do not care much about evince normally, as I do not work there, it got only my attention in this situation. My working tool for marking pdf documents is Xournal, and sometimes also Xournal++, having a few more features, in case I want those. The file formats are compatible anyway, nothing lost.
What do you like mostly in foxit reader?
What do you like mostly in foxit reader?
It allows marking up the PDF: underlining, crossing out, highlighting, making notes, etc. This was one of my holdups in moving to Linux for years (I only made the permanent switch from Windows 2 yrs ago), and it is what allowed me to fiiiiinally quit printing hundreds of pages of paper just so I could take notes, as now I can do it digitally, which is soooo much better!
If Xournal can do that too I'll take a look, but Foxit Reader is the only PDF tool I've found thus far that can do PDF markup in Linux, and it also happens to be no-cost.
I am using shutter as well, cool example with Dijkstra.
Yeah you should definitely check Xournal and Xournal++, but start with Xournal as it is the original more stable and faster one. Then you will throw Foxit out of the window. :+1:
Xournal is developed by a maths professor I guess, so he uses it himself as well. It uses a subset of svg for its format. There are many interesting tools for it as well, from command line you can generate your new pdf's with your marks for example. Of course later you could edit again the source file then re-export. It is just loading the pdf as background, and you start painting on it, like photoshop or so, only optimized for hand writing and resaving using svg kind of vector format. For lecture scenarious just perfect.
original xournal format .xoj files are gzip compressed xml files
rename to .gz gunzip name.gz
.xopp (from xournal++), same technique rename to .gz gunzip name.gz
Which hardware are you using for digital writing, then my example with graphic tablet driver was just perfect kind of, I am using currently XP-Pen Deco 03, but its linux driver is lacking free setup of the rotary wheel on top left, this is why I bought this device actually, but now still waiting for the linux driver update, contacted already its developer, well not sure when such a feature update will appear. And I bought it for its wireless operation feature, also interesting for lecture scenarios.
Which hardware are you using for digital writing?
A keyboard and mouse. :) I type into the PDF to take notes. Thanks for the info on Xournal and Xournal++. I'll check them out.
This is off-topic, but side note: I don't want to misrepresent the value of goto
here by citing Djikstra's paper out of context. I use goto
all the time, under certain, well-defined error-handling cases. See my answer here, including my links at the end: https://stackoverflow.com/a/54488289/4561887.
In case you want to switch over to hand writing, because more powerful as you can draw whatever you want, add paintings and so on you know at least my choice. First I thought a tablet with a screen is better, but later I recognized just the opposite is true, this device, or such devices without any screen are first cheaper, but forget the price, I see it only as a replacement for a regular mouse, and they just work, needing only a usb plug, like a mouse. You will never have any problems regarding beamers and multiple different resolutions. It just works, cheap. Problem free solution just doing its job.
Here is Xournal developers website, if you want to check: http://people.math.harvard.edu/~auroux/ and here you can see how he is using it, just one example, there are more under lecture notes: http://people.math.harvard.edu/~auroux/papers/slides-curvemirrors-zoominar-may2020.pdf
@michaelsjackson , update: the problem lies upstream of evince even, in a package called Poppler. Please upvote this issue here to get it some attention from the Poppler team: Updated link to the upstream issue for Poppler: https://gitlab.freedesktop.org/poppler/poppler/-/issues/157. Thanks.
closing this issue since the problem lies with upstream dependencies, not with pdf2searchablepdf
Continued from here: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF/issues/7#issuecomment-673742291.