FilipePS / Traduzir-paginas-web

Translate your page in real time using Google or Yandex
https://addons.mozilla.org/pt-BR/firefox/addon/traduzir-paginas-web/
Mozilla Public License 2.0
4.35k stars 523 forks source link

local pdf to html conversion #277

Open tino32 opened 3 years ago

tino32 commented 3 years ago

I just tested the new pdf translation feature. It's a great effort but not super convenient yet. For a while I have been using this command on linux to convert to html with better results
pdftohtml -enc UTF-8 -nodrm -noframes -dataurls -c myfile.pdf ( from poppler. Options keep figures in place, in a single html file, and ignores adobe DRM crap) The TWP add-on then works flawlessly on the local html file.

Currently, with the web service, the pdf needs to be downloaded anyway, and then re-uploaded for conversion. Wouldn't it be cleaner and more secure to do the html conversion offline like this?

I have no idea how/if it is possible to run a terminal command from Firefox. I would guess not possible for security reasons. Maybe the add-on can incorporate the poppler source code. As an intermediate easy solution I would suggest that if a website gives me a pdf, the add-on could generate a one-liner command that I can copy paste to my terminal that 1) downloads file to a temp folder, 2) converts to html , 3) re-opens in a new tab, 4) TWP translation is imediaty on with language pair as the origin tab. Would save quite few clicks.

Not sure if this is the right approach to submit an issue. Just a suggestion, I hope it helps.

FilipePS commented 2 years ago

The currently used command looks like this -c -i -hidden -zoom "1.3". It was the best configuration I found. Note that the -i command ignores images, without this command very large pdfs full of images could take hours to be converted to HTML. My server converts between 300 to 400 pdfs per day. It would be unfeasible to extract images from these pdfs.

Although my extension could just send the pdf link to my server then the server would download and convert the file. I preferred to do it the current way. Because it is possible some pdfs may require authentication to be downloaded.

Here is the server source code. I'm thinking of adding in the extension some option to use local server (self-hosted). https://github.com/FilipePS/Poppler-Server