coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.34k stars 1.84k forks source link

Some ligatures don't seem to work #7

Closed raphink closed 12 years ago

raphink commented 12 years ago

This is a very nice project. I've been dreaming of something like that for some time (see http://tex.stackexchange.com/questions/18139/converting-latex-to-html5).

I've just tried converting a few documents and had some issues. For example, if you take Crocodoc's example document at http://personal.crocodoc.com/KhoD84, some characters are not rendered properly, and I get warnings during the conversion (using the master branch):

$ pdf2htmlEX TechCrunch\ -\ Font\ Magazine\ Issue\ 007\ -\ For\ embedding.pdf Working: Warning: encoding confliction detected in font: f1 Warning: encoding confliction detected in font: f2 Warning: encoding confliction detected in font: f4 Warning: encoding confliction detected in font: f7 .Warning: encoding confliction detected in font: f8 Warning: encoding confliction detected in font: f9 Warning: encoding confliction detected in font: fb Warning: encoding confliction detected in font: fc Warning: encoding confliction detected in font: fd Warning: encoding confliction detected in font: fe Warning: encoding confliction detected in font: ff Warning: encoding confliction detected in font: f10 Warning: encoding confliction detected in font: f11 Warning: encoding confliction detected in font: f12 Warning: encoding confliction detected in font: f13 Warning: encoding confliction detected in font: f14 Warning: encoding confliction detected in font: f15 .Warning: encoding confliction detected in font: f16 Warning: encoding confliction detected in font: f17 Warning: encoding confliction detected in font: f18 Warning: encoding confliction detected in font: f19 Warning: encoding confliction detected in font: f1a Warning: encoding confliction detected in font: f1b Warning: encoding confliction detected in font: f1c Warning: encoding confliction detected in font: f1d Warning: encoding confliction detected in font: f1e Warning: encoding confliction detected in font: f1f Warning: encoding confliction detected in font: f20 Warning: encoding confliction detected in font: f21 Warning: encoding confliction detected in font: f22 Warning: encoding confliction detected in font: f23 .Warning: encoding confliction detected in font: f24 Warning: encoding confliction detected in font: f25 Warning: encoding confliction detected in font: f26 Warning: encoding confliction detected in font: f27 .Warning: encoding confliction detected in font: f28 Warning: encoding confliction detected in font: f29 Warning: encoding confliction detected in font: f2a Warning: encoding confliction detected in font: f2b Warning: encoding confliction detected in font: f2c Warning: encoding confliction detected in font: f2d Warning: encoding confliction detected in font: f2e Warning: encoding confliction detected in font: f2f Warning: encoding confliction detected in font: f30 Warning: encoding confliction detected in font: f31 Warning: encoding confliction detected in font: f32 Warning: encoding confliction detected in font: f33 Warning: encoding confliction detected in font: f35 Warning: encoding confliction detected in font: f36 Warning: encoding confliction detected in font: f37 Warning: encoding confliction detected in font: f38 Warning: encoding confliction detected in font: f39 Warning: encoding confliction detected in font: f3a Warning: encoding confliction detected in font: f3b .Warning: encoding confliction detected in font: f3c Warning: encoding confliction detected in font: f3d Warning: encoding confliction detected in font: f3e Warning: encoding confliction detected in font: f3f Warning: encoding confliction detected in font: f40 Warning: encoding confliction detected in font: f41 Warning: encoding confliction detected in font: f42 Warning: encoding confliction detected in font: f43 Warning: encoding confliction detected in font: f44 .Warning: encoding confliction detected in font: f45 Warning: encoding confliction detected in font: f46 Warning: encoding confliction detected in font: f47 Warning: encoding confliction detected in font: f48 Warning: encoding confliction detected in font: f49 Warning: encoding confliction detected in font: f4a Warning: encoding confliction detected in font: f4b Warning: encoding confliction detected in font: f4c Warning: encoding confliction detected in font: f4d Warning: encoding confliction detected in font: f4e Warning: encoding confliction detected in font: f4f Warning: encoding confliction detected in font: f50 Warning: encoding confliction detected in font: f51 Warning: encoding confliction detected in font: f52 Warning: encoding confliction detected in font: f53 Warning: encoding confliction detected in font: f54 Warning: encoding confliction detected in font: f55 Warning: encoding confliction detected in font: f56 Warning: encoding confliction detected in font: f57 Warning: encoding confliction detected in font: f58 Warning: encoding confliction detected in font: f59 .Warning: encoding confliction detected in font: f5d Warning: encoding confliction detected in font: f5e .Warning: encoding confliction detected in font: f5f Warning: encoding confliction detected in font: f60 ..Warning: encoding confliction detected in font: f67 Warning: encoding confliction detected in font: f68 ..Warning: encoding confliction detected in font: f69 Warning: encoding confliction detected in font: f6a Warning: encoding confliction detected in font: f6c Warning: encoding confliction detected in font: f6e .

coolwanglu commented 12 years ago

Hello, thanks for reporting. The program is confirmed, which is about font encoding. I'll try to dig in deeper.

coolwanglu commented 12 years ago

Problem identified. Please try the 'exp' branch, see if it works for you.

So it's not about ligatures, it's about conflicting in font encodings. While that's actually not a real 'conflicting' as some codes are not used. So I added a preprocessor to resolve it.

I'd decided not to implement this feature until some one find this issue :P

raphink commented 12 years ago

src/FontPreprocessor.h seems to be missing in the exp branch.

coolwanglu commented 12 years ago

Oops, my bad.

Just updated.

On Fri, Sep 7, 2012 at 2:53 PM, Raphaël Pinson notifications@github.comwrote:

src/FontPreprocessor.h seems to be missing in the exp branch.

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/7#issuecomment-8357582.

raphink commented 12 years ago

Wonderful, thanks!

coolwanglu commented 12 years ago

Did it work for you?

raphink commented 12 years ago

Yes, thank you.

nmm commented 11 years ago

Hi, Thank you for the great project! Just to ask, is this fix added to the main repository ?

For some documents I have warnings like: encoding confliction detected in font: 41 encoding confliction detected in font: 67

this is with --tounicode 1 if using default 0, then the message is: ToUnicode CMap is not valid and got dropped ToUnicode CMap is not valid and got dropped

I am using it on Cygwin with latest popler and fontforge

Thanks

coolwanglu commented 11 years ago

@nmm Yes it has been. Please provide the affected PDF file if possible.

nmm commented 11 years ago

yep, here is it: https://www.dropbox.com/s/5kxg1oc41vxd06g/doc4.pdf

coolwanglu commented 11 years ago

Yes I've tried with the file, and got the message. But in which page(s) are there any problems?

On Mon, Nov 26, 2012 at 11:34 PM, nmm notifications@github.com wrote:

yep, here is it: https://www.dropbox.com/s/5kxg1oc41vxd06g/doc4.pdf

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/7#issuecomment-10719667.

nmm commented 11 years ago
On page 6, where the first warning
  occurs, "LEANING POWER OF TISA" seems without spaces. 
  But this seems browser (Chrome) issue, as with FF and IE spaces
  between the words exists. 
  So may be everything is fine, thank you! 
  On 29.11.2012 г. 15:46 ч., Lu Wang wrote:
Yes I've tried with the file, and got the message.

  But in which page(s) are there any problems?

  On Mon, Nov 26, 2012 at 11:34 PM, nmm
  <notifications@github.com> wrote:

  > yep, here is it:
  https://www.dropbox.com/s/5kxg1oc41vxd06g/doc4.pdf

  >

  > —

  > Reply to this email directly or view it on

GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/7#issuecomment-10719667.

  >

    —
    Reply to this email directly or view
      it on GitHub.
coolwanglu commented 11 years ago

I see it, it's a bug about the space character in a font. Could you please file a new bug?

coolwanglu commented 11 years ago

@nmm I should have fixed this, please try the latest master branch. Hope that it won't cause new issues.

nmm commented 11 years ago

Hi, I rebuild and now it is better, only the last space is missing (Chrome only) https://www.dropbox.com/s/1amfxbadimqx54a/Capture.PNG

coolwanglu commented 11 years ago

Indeed, but I've got no idea about the cause now.

On Sat, Dec 1, 2012 at 12:55 AM, nmm notifications@github.com wrote:

Hi, I rebuild and now it is better, only the last space is missing (Chrome only) https://www.dropbox.com/s/1amfxbadimqx54a/Capture.PNG

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/7#issuecomment-10895888.

coolwanglu commented 11 years ago

I've set the CSS property "whitespace:pre" and proper fonts there, the space should occupy some space there, but it does not. I guess maybe it's a bug of Chrome.

On Sat, Dec 1, 2012 at 3:10 AM, 王璐 coolwanglu@gmail.com wrote:

Indeed, but I've got no idea about the cause now.

On Sat, Dec 1, 2012 at 12:55 AM, nmm notifications@github.com wrote:

Hi, I rebuild and now it is better, only the last space is missing (Chrome only) https://www.dropbox.com/s/1amfxbadimqx54a/Capture.PNG

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/7#issuecomment-10895888.

nmm commented 11 years ago

Hi again, Happy New Year :) I fall to another Chrome issue: See how the word "Arbeitsmarkt" has visible space between Arbeits and markt, even it should't. https://www.dropbox.com/s/hi7k3crwgbu0t8a/Arbeits_markt.PNG Probably this is Webkit problem, as Safari has this as well. The issue appears less or more in whole document, and the space differs when zooming the document. This worse how the document looks. Here is the HTML source of conversion result:

<span class="f1 s4 c0 l5 w9 r0">
    Arb
    <span class="_ _1"></span>
    eits
    <span class="_ _1"></span>
    mar
    <span class="_ _4"></span>
    kt 
    <span class="l0 w0"> </span>
</span>

Is it possible that the span does not break a whole word, when it is not necessarily? , i.e.:

<span class="f1 s4 c0 l5 w9 r0">
    Arbeitsmarkt 
    <span class="l0 w0"> </span>
</span>

Here is the pdf

If you convert it, there is also another issue on page 10 - a strange "g" bellow "Abbildung 2"

coolwanglu commented 11 years ago

Thanks for reporting, but please open a new issue instead of simply leaving message here.

pdf2htmlEX is a converter, it always follows the original content in PDF. So if you see spaces among letter, that would means there ARE spaces in the PDF. And it's not possible for pdf2htmlEX to recognize the words.

About zooming, indeed there are problems for different browsers, the problem should not be with the spaces, but font sizes. The browser may round the font size, such that letter will be larger or smaller than they should be. To check if it's the problem you saw, just zoom in, when the letters are large enough, there should never be any problem with font sizes (and thus spaces).

Currently I don't have any solutions to this, but you may want to check the manpage about the --font-size-multiplier parameter.

nmm commented 11 years ago

OK i will make new issue for the last problem with "g" character. About the spaces, please note that i don't mean space literal, just a visualized whitespace where there is no space literals. here is the zoomed + selected screenshot of the word: https://www.dropbox.com/s/qehunz8qq9jc4ll/Arbeits_marktZoomedSelected.PNG I am wondering can we remove these empty span tags, as in pdf there is nothing between characters for the word - i think this will resolve the issue? Shall i make another issue for this too?

coolwanglu commented 11 years ago

Can you point out the location of the word 'arbeitsmarkt' in the PDF? or show me a pdf with only that word. I want to compare the visuals.

I didn't mean ' ' either, but consider this:

In PDF, almost everything is absolutely positioned. For the word apple, suppose all letter are of width 10px, in PDF, the following two operations should be the same

(1) write 'apple' at (0,0) with font size 10px (2) write 'app' at (0,0) and 'le' at (30,0), both with font size 10px

and there is no ' '

(2) actually often appears in PDF, I don't know why but it's the truth.

pdf2htmlEX cannot and will not recognize the word 'apple' in (2), and will generate the HTML code you saw.

The problems when you zoom out, say 33%, idealy (2) should become (2) write 'app' at (0,0) and 'le' at (10,0), with font size 3.33px. But the browser may not be happy with '3.33px', it may use '3px' or '3.5px' or something else, such that the letters may overlapped or be separated.

Also notice that there is no ' ' inside the tag except for the last one, which is correct.

On Mon, Jan 7, 2013 at 7:00 PM, nmm notifications@github.com wrote:

OK i will make new issue for the last problem with "g" character. About the spaces, please note that i don't mean space literal, just a visualized whitespace where there is no space literals. here is the zoomed + selected screenshot of the word: https://www.dropbox.com/s/qehunz8qq9jc4ll/Arbeits_marktZoomedSelected.PNG I am wondering can we remove these empty span tags, as in pdf there is nothing between characters for the word - i think this will resolve the issue? Shall i make another issue for this too?

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/7#issuecomment-11947768.

coolwanglu commented 11 years ago

More explanation about the span tag. Basically it's an optimization for HTML when the case (2) is found. I'll try to combine 'le' with 'app'

However I have to follow the metrics specified in PDF. If you inspect the PDF file, I'm sure you will see separated operations for the entire word -- correct me if I'm wrong.

On the other hand, there are a few options you may be able to tweak, something like --heps or those with space widths.

On Mon, Jan 7, 2013 at 8:48 PM, 王璐 coolwanglu@gmail.com wrote:

Can you point out the location of the word 'arbeitsmarkt' in the PDF? or show me a pdf with only that word. I want to compare the visuals.

I didn't mean ' ' either, but consider this:

In PDF, almost everything is absolutely positioned. For the word apple, suppose all letter are of width 10px, in PDF, the following two operations should be the same

(1) write 'apple' at (0,0) with font size 10px (2) write 'app' at (0,0) and 'le' at (30,0), both with font size 10px

and there is no ' '

(2) actually often appears in PDF, I don't know why but it's the truth.

pdf2htmlEX cannot and will not recognize the word 'apple' in (2), and will generate the HTML code you saw.

The problems when you zoom out, say 33%, idealy (2) should become (2) write 'app' at (0,0) and 'le' at (10,0), with font size 3.33px. But the browser may not be happy with '3.33px', it may use '3px' or '3.5px' or something else, such that the letters may overlapped or be separated.

Also notice that there is no ' ' inside the tag except for the last one, which is correct.

On Mon, Jan 7, 2013 at 7:00 PM, nmm notifications@github.com wrote:

OK i will make new issue for the last problem with "g" character. About the spaces, please note that i don't mean space literal, just a visualized whitespace where there is no space literals. here is the zoomed + selected screenshot of the word: https://www.dropbox.com/s/qehunz8qq9jc4ll/Arbeits_marktZoomedSelected.PNG I am wondering can we remove these empty span tags, as in pdf there is nothing between characters for the word - i think this will resolve the issue? Shall i make another issue for this too?

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/7#issuecomment-11947768.