dawbarton / pdf2svg

A simple PDF to SVG converter using the Poppler and Cairo libraries
GNU General Public License v2.0
616 stars 81 forks source link

Is is possible to convert to SVG but keep text as text? #17

Open Dingo64 opened 7 years ago

Dingo64 commented 7 years ago

Is is possible to convert to SVG but keep text as text?

RonanKER commented 5 years ago

I thing "pdf2svg" is not able to do anything about that, it depends of Poppler or Cairo library

yuweiming2016 commented 5 years ago

@RonanKER ,do you hava any code or configuration to show it ? i am looking for the way to let pdf2svg keep text as text from google for a week ,but nothing useful for me ,can you help me ?

dawbarton commented 5 years ago

If you want to keep text in the SVG then your best bet is to use Inkscape. I'm fairly sure it can be used from the command line to automate the conversion with text (though I've never used it for automated PDF -> SVG, only manually). Be aware that text often moves around a bit (the kerning is often a little off) when converting from a PDF.

dawbarton commented 5 years ago

See https://inkscape.org/doc/inkscape-man.html for details on the Inkscape command line.

yuweiming2016 commented 5 years ago

I have learned to use Inkscape for a week. as i know Inkscape can just convert pdf to svg for the first page.is this real? this is bad news for me.@dawbarton

dawbarton commented 5 years ago

It can open any page when opening with the gui. If you want everything via the command line, you can simply use qpdf or pdftk to extract the page you want from the PDF as a single page and then use Inkscape. (Inkscape might be able to do page selection from the command line, I just don't know how.)

yuweiming2016 commented 5 years ago

i google for a long time ,but nothing is useful,so sad

image

RonanKER commented 5 years ago

I got an old batch script from 2015 when I tryed it (with pdftk and inkscape) : test_inkscape.txt

in the folder 'in' I put several pdf exemple/test files, and then i lunched several similar batch files to try several solutions (inkscape, pdf2svg, pdftron, poppler, ...) and then compare results.

If you can afford it, i think pdftron was the best, but i'm not sure it would preserve text as you wich.

danielk892374 commented 5 months ago

could anyone hint me in the right direction to understand why neither cairo nor poppler preserve text during pdf to svg conversion (to find some workaround to force them to keep it)? Does this procedure have a name? Is it "text vectorization" by any chance?

By the way I've tried inkscape as well, but no luck. Libreoffice seemed to work, but it was extremely slow and created a large .svg file, which is very hard to open.

dawbarton commented 5 months ago

I'm not sure what the name is ("preserve text" would have been my guess). Inkscape is usually the best in recent years - I've not had any problems with the PDFs that I've given it recently. It might be worth running pdftotext on your PDF to see if it does actually contain any text.

danielk892374 commented 4 months ago

After some research on PDFs in general I've realized that the problem was in the text being not a "regular text", but as part of "annotaton/comments" objects. These often get ignored when being imported and I believe that inkscape excluded them as well.