Open cdeblois opened 10 years ago
Text selection issue in your file is not related to --correct-text-visibility
, it also occurs without any params.
The problem is: there is a div.c
near the end of the first page of the generated html, which covers the whole page, so that makes almost all texts unselectable -- except itself.
A div.c
coresponds to a clip path in PDF file and uses overflow:hidden
to achieve a clipping effect. Unfortunately, it has the side effect of blocking user interactions (including seletion) with elements below a div.c
.
In fact, --correct-text-visibility
already handles clipping quite well, and should not have such side effect. So can we suppress div.c
when --correct-text-visibility
is on? Can we use CSS clipping when --correct-text-visibility
is off? Or just get rid of div.c
completely?
@cdeblois, your issue title and content seem not quite accurate, can you modify it?
@duanyao - Is that title better for you?
Yes, that's ok.
@duanyao, please let me know if you think you can make the suggested changes for me to try out.
Thanks
@cdeblois I think it isn't a trival task to fix this in pdf2htmlEX and may take time.
If you can afford modifying the output html by hand, you can locate the problematic div.c
and assign z-index: -1
style to it. However sometimes this can make stacking order of texts incorrect.
Not sure I understand what you mean by div.c ? My understanding is that it's not always the same div causing the glass effect.
div.c
means div
elements that have class c
-- in pdf2htmlEX generated html, class c
means "clip".
Yes, this issue depends on how a PDF using clips. You can use devtools in browers (ususlly F12
key) to locate the problematic div.c
, and modify it by a text editor.
Introduction to devtools in browers: http://www.labnol.org/software/chrome-dev-tools-tutorial/28131/ https://developer.mozilla.org/en-US/docs/Tools/Page_Inspector
Thanks. Optimally, we are looking for a programmatic solution here as I don't have the option to do this manually, however, a programmatic solution to give options to the users may be workable.
So, do you believe that if this class 'c' is removed that it will resolve this issue of non-selectable text?
Is it only this class 'c' that would be causing these types of issues?
I've tried manually removing the class 'c' in articles where this issue exists and It seems to mess up the formatting when it's removed unless you update the right div tag to remove that class and even then there is still formatting loss in some cases.
I'm just thinking that we could offer to user to remove one class 'c' at a time and have them visually inspect result and if no help or messes up formatting then we would give the user option to try another class 'c' removal.
No, don't remove div.c
(or some texts will be removed also), you should add an attribute style="z-index:-1"
to it. If you are not familiar with html & css and don't understand what I say, you have to wait until this issue get fixed -- however there is no schedule, sorry.
My observation is that when doing this it actually causes text to be removed versus just removing "c" from the class list. Most of the manual tests I've tried with a variety of documents seem to allow text to be selected after just removing the "c" class from the div.
No changes
this text is not selectable
With only the style="z-index:-1" added to div (<div class="c x0 y0 w2 h0" style="z-index: -1") or added to the class (.c {border: 0 none; display: block; margin: 0; overflow: hidden; padding: 0; position: absolute; z-index: -1;}:
there is missing content
With only the "c" class removed from list (<div class="x0 y0 w2 h0"):
the text is selectable
Seems that the selective removal of the "c" class is solving most of our problems here. Not sure I'm seeing much apparent value when the "c" class is being used. Does this make any sense to you?
Also, I guess I didn't understand who your questions above were directed to:
"So can we suppress div.c when --correct-text-visibility is on? Can we use CSS clipping when --correct-text-visibility is off? Or just get rid of div.c completely?" Seems like I decided to run with one of your ideas and it is working for me.
It seems you need to add a rule .pc { z-index: 0; }
to make the z-index: -1
trick work in chrome. The complete rules could be:
.c { z-index: -1; }
.bf { z-index: -1; } /* bring background image below .c */
.pc { z-index: 0; } /* ensure a new stacking context on chrome*/
Removing class c
is more problematic, because its children are positioned relative to it, and clipped by it.
Those questions were asked to other developers of pdf2htmlEX.
@duanyao, are you suggesting to selectively apply the above? Reason I ask is because if you apply it globally as you have shown the result after doing so is all content is removed in all browsers.
Add all the 3 rules to the CSS. It works in firefox and chrome for me. What browsers did you test?
Oh, probably you are using png/jpg background image (I was using svg), In this case .bf
becomes .bi
, so the rules to be added are:
.c, .bf, .bi { z-index: -1; }
.pc { z-index: 0; }
@duanyao, that seems better, Thx!....I will retest with the 4 document examples I've been looking at and see how it all works out. We support all the major browsers.
@duanyao, Everything looks fine now. I think this a reasonable solution for us here. Please let me know if it warrants a consideration for a pdf2htmlEX flag/option and thanks again!
This CSS trick should be considered as a hack, not a universal solution. If there are multiple .c
elements in a page and are overlapping each other, some of the texts will still be unselectable. Additionally the z-order of texts is altered, sometimes may result in undesired appearance.
So I'd like to keep this bug open.
Yes, for now we are exploring the "hack" results through some exhaustive testing. Yes, I'm ok with leaving it open. What I do believe is possible though in our case is an interactive solution where "c" class is the focus. Remove one, consider output, repeat as necessary.
I'm using this pdf: http://journal.frontiersin.org/Journal/10.3389/fphy.2013.00021/pdf.
When transformed using --correct-text-visibility flag it transforms nearly perfectly. However, the first page of text is not selectable at all. All pages after seem to be fine.
I'm using the 'incoming' branch version 0.13 for this.