coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.37k stars 1.84k forks source link

Text is not selectable in resulting HTML #423

Open cdeblois opened 10 years ago

cdeblois commented 10 years ago

I'm using this pdf: http://journal.frontiersin.org/Journal/10.3389/fphy.2013.00021/pdf.

When transformed using --correct-text-visibility flag it transforms nearly perfectly. However, the first page of text is not selectable at all. All pages after seem to be fine.

I'm using the 'incoming' branch version 0.13 for this.

duanyao commented 10 years ago

Text selection issue in your file is not related to --correct-text-visibility, it also occurs without any params.

The problem is: there is a div.c near the end of the first page of the generated html, which covers the whole page, so that makes almost all texts unselectable -- except itself.

A div.c coresponds to a clip path in PDF file and uses overflow:hidden to achieve a clipping effect. Unfortunately, it has the side effect of blocking user interactions (including seletion) with elements below a div.c.

In fact, --correct-text-visibility already handles clipping quite well, and should not have such side effect. So can we suppress div.c when --correct-text-visibility is on? Can we use CSS clipping when --correct-text-visibility is off? Or just get rid of div.c completely?

@cdeblois, your issue title and content seem not quite accurate, can you modify it?

cdeblois commented 10 years ago

@duanyao - Is that title better for you?

duanyao commented 10 years ago

Yes, that's ok.

cdeblois commented 10 years ago

@duanyao, please let me know if you think you can make the suggested changes for me to try out.

Thanks

duanyao commented 10 years ago

@cdeblois I think it isn't a trival task to fix this in pdf2htmlEX and may take time. If you can afford modifying the output html by hand, you can locate the problematic div.c and assign z-index: -1 style to it. However sometimes this can make stacking order of texts incorrect.

cdeblois commented 10 years ago

Not sure I understand what you mean by div.c ? My understanding is that it's not always the same div causing the glass effect.

duanyao commented 10 years ago

div.c means div elements that have class c -- in pdf2htmlEX generated html, class c means "clip". Yes, this issue depends on how a PDF using clips. You can use devtools in browers (ususlly F12 key) to locate the problematic div.c, and modify it by a text editor.

Introduction to devtools in browers: http://www.labnol.org/software/chrome-dev-tools-tutorial/28131/ https://developer.mozilla.org/en-US/docs/Tools/Page_Inspector

cdeblois commented 10 years ago

Thanks. Optimally, we are looking for a programmatic solution here as I don't have the option to do this manually, however, a programmatic solution to give options to the users may be workable.

So, do you believe that if this class 'c' is removed that it will resolve this issue of non-selectable text?
Is it only this class 'c' that would be causing these types of issues?

I've tried manually removing the class 'c' in articles where this issue exists and It seems to mess up the formatting when it's removed unless you update the right div tag to remove that class and even then there is still formatting loss in some cases.

I'm just thinking that we could offer to user to remove one class 'c' at a time and have them visually inspect result and if no help or messes up formatting then we would give the user option to try another class 'c' removal.

duanyao commented 10 years ago

No, don't remove div.c(or some texts will be removed also), you should add an attribute style="z-index:-1" to it. If you are not familiar with html & css and don't understand what I say, you have to wait until this issue get fixed -- however there is no schedule, sorry.

cdeblois commented 10 years ago

My observation is that when doing this it actually causes text to be removed versus just removing "c" from the class list. Most of the manual tests I've tried with a variety of documents seem to allow text to be selected after just removing the "c" class from the div.

No changes image this text is not selectable
With only the style="z-index:-1" added to div (<div class="c x0 y0 w2 h0" style="z-index: -1") or added to the class (.c {border: 0 none; display: block; margin: 0; overflow: hidden; padding: 0; position: absolute; z-index: -1;}: image there is missing content
With only the "c" class removed from list (<div class="x0 y0 w2 h0"): image the text is selectable

Seems that the selective removal of the "c" class is solving most of our problems here. Not sure I'm seeing much apparent value when the "c" class is being used. Does this make any sense to you?

Also, I guess I didn't understand who your questions above were directed to:
"So can we suppress div.c when --correct-text-visibility is on? Can we use CSS clipping when --correct-text-visibility is off? Or just get rid of div.c completely?" Seems like I decided to run with one of your ideas and it is working for me.

duanyao commented 10 years ago

It seems you need to add a rule .pc { z-index: 0; } to make the z-index: -1 trick work in chrome. The complete rules could be:

.c { z-index: -1; }
.bf { z-index: -1; } /* bring background image below .c */
.pc { z-index: 0; } /* ensure a new stacking context on chrome*/

Removing class c is more problematic, because its children are positioned relative to it, and clipped by it.

Those questions were asked to other developers of pdf2htmlEX.

cdeblois commented 10 years ago

@duanyao, are you suggesting to selectively apply the above? Reason I ask is because if you apply it globally as you have shown the result after doing so is all content is removed in all browsers.

duanyao commented 10 years ago

Add all the 3 rules to the CSS. It works in firefox and chrome for me. What browsers did you test?

Oh, probably you are using png/jpg background image (I was using svg), In this case .bf becomes .bi, so the rules to be added are:

.c, .bf, .bi { z-index: -1; }
.pc { z-index: 0; }
cdeblois commented 10 years ago

@duanyao, that seems better, Thx!....I will retest with the 4 document examples I've been looking at and see how it all works out. We support all the major browsers.

cdeblois commented 10 years ago

@duanyao, Everything looks fine now. I think this a reasonable solution for us here. Please let me know if it warrants a consideration for a pdf2htmlEX flag/option and thanks again!

duanyao commented 10 years ago

This CSS trick should be considered as a hack, not a universal solution. If there are multiple .c elements in a page and are overlapping each other, some of the texts will still be unselectable. Additionally the z-order of texts is altered, sometimes may result in undesired appearance.

So I'd like to keep this bug open.

cdeblois commented 10 years ago

Yes, for now we are exploring the "hack" results through some exhaustive testing. Yes, I'm ok with leaving it open. What I do believe is possible though in our case is an interactive solution where "c" class is the focus. Remove one, consider output, repeat as necessary.