coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.34k stars 1.84k forks source link

Printing #40

Closed stanbellcom closed 11 years ago

stanbellcom commented 11 years ago

HTML pages produced by pdf2htmlEX cannot be printed correctly (retaining the formats or correct paging)

This pages lists the reasons and difficulties.

This issue is left for discussing about possible solutions to it. Please read the wiki page above before leaving messages.

coolwanglu commented 11 years ago

It's an old known issue, and has been in the TODO list for quite well. Currently printing is not of high priority, as you can also provide the original PDF for printing, e.g. Wikipedia.

stanbellcom commented 11 years ago

that is right, however not quite applicable in my situation - I have integrated the output received from pdf2htmlEX with one of the HTML annotator, and by printing would like to get the rendered html + the created annotations.

can you please give some approximate date on when the printing feature will be available, if ever?

Thanks

coolwanglu commented 11 years ago

Hmm, now I'm working on improving the background images, and there are quite a number of things to do after that. No guarantee on the speed, as I'm doing it in my spare time.

You may consider filing a commision if you want it be done soon.

stanbellcom commented 11 years ago

that might be a possibility,

now that you know the issue, you could perhaps tell me what would be an estimate price to speed up this feature appearance?

coolwanglu commented 11 years ago

could you please send me an email (which can be found in the project home page)?

On Mon, Oct 15, 2012 at 11:02 PM, stanbellcom notifications@github.comwrote:

that might be a possibility,

now that you know the issue, you could perhaps tell me what would be an estimate price to speed up this feature appearance?

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-9448114.

lalith-b commented 11 years ago

@stanstanbellcom for annotating you could use ghostscript and add annotations on your pdf. By this your pdf's are modified and not the html. Annotations view for pdf2htmlEX is already requested and is being worked on :+1:

iapain commented 11 years ago

Isn't adding css to "print" would solve this?

coolwanglu commented 11 years ago

Yes, it should be about CSS, here are the things I'll need to do.

To calculate (separatedly) the metrics for printing, like height/width and font sizes To tweak the css for several elements, hide the visual effects

These should not be hard, but I'm investigating why the fonts are not working, they should be.

On Fri, Oct 19, 2012 at 3:35 PM, Deepak notifications@github.com wrote:

Isn't adding css to "print" would solve this?

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-9592081.

stanbellcom commented 11 years ago

@deathlord87 thanks for information, i'll look into it. However it does not seem right for my application from the first sight - i'm having multiple users on a website, so that will mean that i'll need to have a copy of pdf for each single user. With HTML annotation it's just easier - I serve the same HTML, and then add the notes onto it. By the way, i'm using great tool: annotateit

but you're certainly right, this will help me solve the printing problem (if ghostscript can print with notes).

coolwanglu commented 11 years ago

@stanbellcom I guess you can store only the original PDF, and generate an temporary annotated PDF when a user request for printing. But probably would make your system more complicated.

lalith-b commented 11 years ago

@stanbellcom

ghostscript only takes in input as text files for annotations text and the the location/rect for annotation. You can create individual text files for each user accessing your app. (tryout HTML5 LocalStorage). When the user clicks download pdf then the GhostScript can be executed with the text to annotate his/her own pdfs.

coolwanglu commented 11 years ago

According to my research on this, there are two major issues prevent it to be done

So I guess this has to be done for quite a while.

If any one find a solution for this, please kindly tell me.

coolwanglu commented 11 years ago

I have created a Wiki page for this issues and updated the issue description.

jahewson commented 11 years ago

Paging cannot be controlled - the latest CSS standard is not supported by browsers right now, SVG might work for this.

SVG won't work for paging.

Most browsers support SVG 1.1, which does not include multiple pages. Multiple pages are part of SVG 1.2 Full, however SVG 1.2 Full was abandoned, and replaced by SVG 1.2 Tiny which some browsers support, but it does not include multiple pages.

The next version of SVG will be SVG 2.0, at some point in the distant future.

coolwanglu commented 11 years ago

@jahewson Thanks for the info.

So before this can be solved with some future version of CSS or SVG, providing the original PDF is the best solution.

purem commented 11 years ago

This has a workaround for Firefox and Chrome not printing @font-face fonts: https://getsatisfaction.com/fontdeck/topics/_font_face_embedded_fonts_do_not_show_up_in_a_print_preview

I'm not sure if its the exact issue but I thought I'd post it anyway on the off chance it may help

coolwanglu commented 11 years ago

@purem, thanks for the info, but as also mentioned in that link, Firefox currenlty does not support this.

jahewson commented 11 years ago

That Firefox bug has "Status: RESOLVED FIXED" since October 2012.

I'm using Firefox 18, and web fonts now print correctly :smile:

coolwanglu commented 11 years ago

Great news. So it is worth a new try.

On Mon, Feb 4, 2013 at 9:01 PM, John Hewson notifications@github.comwrote:

That Firefox bug has "Status: RESOLVED FIXED" since October 2012.

I'm using Firefox 18, and web fonts now print correctly [image: :smile:]

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13075745.

jahewson commented 11 years ago

Printing pdf2htmlEX output in Chrome (23) is broken for me - I always get a single blank page. However, Chrome can print webfonts, because http://www.google.com/webfonts/specimen/Kavoon prints fine.

coolwanglu commented 11 years ago

It must be about CSS.

So now here are the things to do

I'll check them out later.

On Mon, Feb 4, 2013 at 9:13 PM, John Hewson notifications@github.comwrote:

Printing pdf2htmlEX output in Chrome (23) is broken for me - I always get a single blank page. However, Chrome can print WebFonts, because http://www.google.com/webfonts/specimen/Kavoon prints fine.

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13076212.

purem commented 11 years ago

I have been hacking around with the @font-face generation. I am creating a branch where you can pass a comma delimited array of types rather than one type. I then format them in a way which is cross browser. Maybe this will help?

On 4 February 2013 13:20, Lu Wang notifications@github.com wrote:

It must be about CSS.

So now here are the things to do

  • Set proper CSS for the UI
  • Set a separate CSS file for printing, basically to change units (px->pt)
  • Set proper page-breaks (CSS property page-break-before/after)
  • (possible?) Disable header/footer/margin for printing

I'll check them out later.

On Mon, Feb 4, 2013 at 9:13 PM, John Hewson notifications@github.comwrote:

Printing pdf2htmlEX output in Chrome (23) is broken for me - I always get a single blank page. However, Chrome can print WebFonts, because http://www.google.com/webfonts/specimen/Kavoon prints fine.

— Reply to this email directly or view it on GitHub< https://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13076212>.

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13076468.

coolwanglu commented 11 years ago

The reason I chose TrueType as the default one, is that this format is supported by all decent browser. Otherwise the generated HTML files cannot be even correctly displayed on screen. I guess that font format is not a problem right now.

On Mon, Feb 4, 2013 at 9:29 PM, purem notifications@github.com wrote:

I have been hacking around with the @font-face generation. I am creating a branch where you can pass a comma delimited array of types rather than one type. I then format them in a way which is cross browser. Maybe this will help?

On 4 February 2013 13:20, Lu Wang notifications@github.com wrote:

It must be about CSS.

So now here are the things to do

  • Set proper CSS for the UI
  • Set a separate CSS file for printing, basically to change units (px->pt)
  • Set proper page-breaks (CSS property page-break-before/after)
  • (possible?) Disable header/footer/margin for printing

I'll check them out later.

On Mon, Feb 4, 2013 at 9:13 PM, John Hewson notifications@github.comwrote:

Printing pdf2htmlEX output in Chrome (23) is broken for me - I always get a single blank page. However, Chrome can print WebFonts, because http://www.google.com/webfonts/specimen/Kavoon prints fine.

— Reply to this email directly or view it on GitHub< https://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13076212>.

— Reply to this email directly or view it on GitHub< https://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13076468>.

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13076752.

jahewson commented 11 years ago

@purem font files are big, you don't really want to be serving up multiple formats if you can avoid it - especially as pdf2htmlEX embed the fonts as base64 data URIs.

TrueType works cross-browser everywhere except oldIE, which isn't supported by pdf2htmlEX. FontForge can't generate EOT files anyway. WOFF is now a viable option too, it's essentially a gzipped ttf file.

@coolwanglu maybe WOFF should be the default font format?

coolwanglu commented 11 years ago

I remember that WOFF did not works for all...

Embedding is optional, so it's ok if you generate multiple files, and load one of them according to the browser.

On Tuesday, February 5, 2013, John Hewson wrote:

@purem https://github.com/purem font files are big, you don't really want to be sering up multiple formats if you can avoid it - especially as pdf2htmlEX embed the fonts as base64 data URIs.

TrueType works cross-browser everywhere except oldIE, which isn't supported by pdf2htmlEX. FontForge can't generate EOT files anyway.

@coolwanglu https://github.com/coolwanglu maybe WOFF should be the default font format nowadays?

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13085394.

jahewson commented 11 years ago

I remember that WOFF did not works for all...

Embedding is optional, so it's ok if you generate multiple files, and load one of them according to the browser.

There's no point - all browsers that currently support WOFF already support TrueType.

coolwanglu commented 11 years ago

maybe old IE... I will test again. But what's wrong with ttf?

On Tuesday, February 5, 2013, John Hewson wrote:

I remember that WOFF did not works for all...

Embedding is optional, so it's ok if you generate multiple files, and load one of them according to the browser.

There's no point - all browsers that currently support WOFF also support TrueType.

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13085812.

jahewson commented 11 years ago

So now here are the things to do ...

  • Set a separate CSS file for printing, basically to change units (px->pt)

No need, a CSS pixel is a 1/96".

jahewson commented 11 years ago

But what's wrong with ttf?

Nothing really. WOFF files are compressed OTF files, so there's a size advantage. WOFF is supported almost everywhere worth bothering with now.

EDIT: turns out WOFF is not supported on Android.

coolwanglu commented 11 years ago

So how can you do it in one rule?

On Tuesday, February 5, 2013, John Hewson wrote:

So now here are the things to do ...

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13086780.

jahewson commented 11 years ago

TrueType is a good default, keep it as it is. Serving up gzipped TrueType with content-encoding: gzip is going to work everywhere except oldIE.

coolwanglu commented 11 years ago

I meant the px and pt thing, I will have to create two sets of rules, for different media.

On Tuesday, February 5, 2013, John Hewson wrote:

You can't. TrueType is a good default, keep it as it is. Serving up gzipped TrueType with content-encoding: gzip is going to work everywhere except oldIE.

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13089069.

purem commented 11 years ago

Lets look at this statistically

Total percent of users using a browser with some support for @font-face: 93.09% support for WOFF 76.61% (17% of users with some support excluded) support for TTF 66.47% full, 16.03% partial -> 82.5% total (11% of users with some support excluded totally, 27% with support))

Summary stats Best for IE -> EOT Best for firefox -> TTF/OTF Best for Chrome -> SVG/TTF/OTF Best for Safari -> TTF/OTF Best for Opera -> SVG Best for iOS Safari -> SVG Best for android -> TTF Best for blackberry -> TTF/OTF / SVG Best for opera mobile -> TTF/OTF/SVG Best for chrome for android -> TTF/OTF/SVG Best for firefox -> TTF/OTF/WOFF

Comments To support anything before IE 9 you need to use EOT. IE only has partial support of TTF with @font-face. To support old versions of iOS safari and opera you need SVG TTF/OTF seems to trump WOFF in terms of compatability in all respects however the smaller file size of WOFF is obviously prefered.

Stats from: http://caniuse.com/fontface http://caniuse.com/woff http://caniuse.com/ttf http://caniuse.com/svg-fonts http://caniuse.com/eot

To remedy this we can write fonts in all formats and then either:

A.) Use multiple font-face formats in the declaration. Font squirrel http://www.fontsquirrel.com/fontface/generator recommends the following as its "optimal" layout, "Recommended settings for performance and speed.". Its recommended by Paul Irish who knows his stuff and has done a long blog post on the ins and outs of @font-face declarations http://paulirish.com/2009/bulletproof-font-face-implementation-syntax/

@font-face { font-family: 'pragmataproregular'; src: url('pragmatapro-webfont.eot'); src: url('pragmatapro-webfont.eot?#iefix') format('embedded-opentype'), url('pragmatapro-webfont.woff') format('woff'), url('pragmatapro-webfont.ttf') format('truetype'), url('pragmatapro-webfont.svg#pragmataproregular') format('svg'); font-weight: normal; font-style: normal;

}

Quick Explaination The first declaration is for IE9 / IE10 as EOT is the prefered font (IEs format so supposdely it does a good job at it). The second for IE8 and below. It will take the last src: statement as the one to use. IE8 and below have a bug in their parsers which results in 404s when more than one font is included in a src declaration. The ?#iefix tricks it into thinking everything after the ? is a query string and so it skips the other font formats. http://stackoverflow.com/questions/8050640/how-does-iefix-solve-web-fonts-loading-in-ie6-ie8 The woff, ttf and svg fonts are read by the rest of the browsers if they can read them (see above for importance of all types if maximising compatability). All browsers should only load one font so bandwidth really isn't an issue.

B.) If they do, we could take the aproach that scribd takes and detect browsers before injecting a css file containing the fonts in the most relevant format. This would require writing a separate css file for each font type which contains all required fonts.

Neither are too hard to implement and will result in increased support. I propose that we make A. the default as this version doesn't require JS.

jahewson commented 11 years ago

I meant the px and pt thing, I will have to create two sets of rules, for different media.

Ha ok. To get different rules for different media you need a CSS media query for print which overrides your usual styles. This is similar to the WebKit stroked text hack I used.

coolwanglu commented 11 years ago

That is what I meant, they are still two rules.

On Tuesday, February 5, 2013, John Hewson wrote:

I meant the px and pt thing, I will have to create two sets of rules, for different media.

Ha ok. To get different rules for different media you need a CSS media query http://www.w3.org/TR/css3-mediaqueries/ for print which overrides your usual styles.

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13090228.

jahewson commented 11 years ago

Best for iOS Safari -> SVG

@purem, that's very out of date. TTF has been supported since iOS 4.2, and WOFF since iOS 5.0.

purem commented 11 years ago

SVG has best compatibility, its good to have it there just in case. Its minimal effort anyway.

jahewson commented 11 years ago

IE only has partial support of TTF with @font-face.

Not really. Just turn off the font's "installable" flag in the TTF header.

@coolwanglu, does pdf2htmlEX do this? It needs to if not...

coolwanglu commented 11 years ago

Thanks for the info. Not sure if this will fix for IE <9, due to the html 5 tags used.

An option for multiple fonts sounds good, but I think ttf is good enough for default.

On Tuesday, February 5, 2013, purem wrote:

Lets look at this statistically

Total percent of users using a browser with some support for @font-face: 93.09% support for WOFF 76.61% (17% of users with some support excluded) support for TTF 66.47% full, 16.03% partial -> 82.5% total (11% of users with some support excluded totally, 27% with support))

Summary stats Best for IE -> EOT Best for firefox -> TTF/OTF Best for Chrome -> SVG/TTF/OTF Best for Safari -> TTF/OTF Best for Opera -> SVG Best for iOS Safari -> SVG Best for android -> TTF Best for blackberry -> TTF/OTF / SVG Best for opera mobile -> TTF/OTF/SVG Best for chrome for android -> TTF/OTF/SVG Best for firefox -> TTF/OTF/WOFF

Comments To support anything before IE 9 you need to use EOT. IE only has partial support of TTF with @font-face. To support old versions of iOS safari and opera you need SVG TTF/OTF seems to trump WOFF in terms of compatability in all respects however the smaller file size of WOFF is obviously prefered.

Stats from: http://caniuse.com/fontface http://caniuse.com/woff http://caniuse.com/ttf http://caniuse.com/svg-fonts http://caniuse.com/eot

To remedy this we can write fonts in all formats and then either:

A.) Use multiple font-face formats in the declaration. Font squirrel http://www.fontsquirrel.com/fontface/generator recommends the following as its "optimal" layout, "Recommended settings for performance and speed.". Its recommended by Paul Irish who knows his stuff and has done a long blog post on the ins and outs of @font-face declarations http://paulirish.com/2009/bulletproof-font-face-implementation-syntax/

@font-face { font-family: 'pragmataproregular'; src: url('pragmatapro-webfont.eot'); src: url('pragmatapro-webfont.eot?#iefix') format('embedded-opentype'), url('pragmatapro-webfont.woff') format('woff'), url('pragmatapro-webfont.ttf') format('truetype'), url('pragmatapro-webfont.svg#pragmataproregular') format('svg'); font-weight: normal; font-style: normal;

}

Quick Explaination The first declaration is for IE9 / IE10 as EOT is the prefered font (IEs format so supposdely it does a good job at it). The second for IE8 and below. It will take the last src: statement as the one to use. IE8 and below have a bug in their parsers which results in 404s when more than one font is included in a src declaration. The ?#iefix tricks it into thinking everything after the ? is a query string and so it skips the other font formats.

http://stackoverflow.com/questions/8050640/how-does-iefix-solve-web-fonts-loading-in-ie6-ie8 The woff, ttf and svg fonts are read by the rest of the browsers if they can read them (see above for importance of all types if maximising compatability). All browsers should only load one font so bandwidth really isn't an issue.

B.) If they do, we could take the aproach that scribd takes and detect browsers before injecting a css file containing the fonts in the most relevant format. This would require writing a separate css file for each font type which contains all required fonts.

Neither are too hard to implement and will result in increased support. I propose that we make A. the default as this version doesn't require JS.

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13090096.

coolwanglu commented 11 years ago

I am not sure. fontforge does all the font jobs.

But I guess so.

On Tuesday, February 5, 2013, John Hewson wrote:

IE only has partial support of TTF with @font-face.

Not really. Just turn off the font's "installable" flag in the TTF header.

@coolwanglu https://github.com/coolwanglu, does pdf2htmlEX do this? It needs to if not...

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13090624.

jahewson commented 11 years ago

SVG has best compatibility, its good to have it there just in case. Its minimal effort anyway.

@purem - no, they're not supported by Firefox or IE. That's hardly "best compatibility". They don't contain hints so look like rubbish on Windows. A few years ago they were the only way to get web fonts on mobile, but not anymore.

jahewson commented 11 years ago

Not sure if this will fix for IE <9, due to the html 5 tags used.

IE <9 only supports EOT fonts, which FontForge can't produce.

purem commented 11 years ago

I was responding to your post. SVG has best computability for iOS. It doesn't matter that SVG isn't supported in Firefox and IE as they won't be used there!

Supporting more browsers will not be that much work and is a good thing. I'm implementing it now. There aren't any downsides. Whats the problem?

jahewson commented 11 years ago

@purem, ok that's more like it, though the original post said "best", not "best compatibility". From a compatibility point of view it's a waste of time serving SVG fonts to iOS < 4.2 users, because that's only the first generation iPod Touch and iPhone. You're talking <1% of iPhone users, stuck on 5 year old technology.

Supporting more browsers will not be that much work and is a good thing.

It's a huge amount of work to actually test on these browsers, and you need to use an ancient iPhone which hasn't been updated to iOS 4.2. That's the wasted time, not the time it takes to code it. I guess you could just output the SVG font, stick it in the @font-face and hope for the best, but you have no idea if it actually works or not. Does the html and CSS for pdf2htmlEX content even render on iOS < 4.2 anyway, given that it's HTML5? Fonts are only one aspect of browser support.

jahewson commented 11 years ago

That is what I meant, they are still two rules.

@coolwanglu what exactly is the problem with that?

coolwanglu commented 11 years ago

not problem, but things to do.

On Tuesday, February 5, 2013, John Hewson wrote:

That is what I meant, they are still two rules.

@coolwanglu https://github.com/coolwanglu what exactly is the problem with that?

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13093480.

jahewson commented 11 years ago

Yes, lots to do!

jahewson commented 11 years ago

Ok I found out why printing is broken in Chrome, it ignores absolute content when deciding the page size. There's a simple fix:

1. delete the following CSS attributes from #pdf-main:

position: absolute;
top: 0; 
left: 0;
bottom: 0;
right: 0;

2. add the declaration:

body {
   margin: 0;
}

Now printing in Chrome will work. It still needs some improvement though...

jahewson commented 11 years ago

Looks like CSS has page-breaks http://davidwalsh.name/css-page-breaks

coolwanglu commented 11 years ago

I already mentioned page-break-before/after in the "things to do" list

About CSS, for screen, the default UI needs a fix-height container to handle scrolling etc, let me see if ":visible" from jquery works. But anyway separate media queries should work.

Does body {margin:0} remove the page margin when printing ?

On Tue, Feb 5, 2013 at 3:59 AM, John Hewson notifications@github.comwrote:

Looks like CSS has page-breaks http://davidwalsh.name/css-page-breaks

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/40#issuecomment-13095945.