coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.37k stars 1.84k forks source link

Visibility test for text #64

Closed razamobin closed 9 years ago

razamobin commented 11 years ago

Text that are covered by following images or other objects, should not be visible in the final HTML. For each piece of text, we should test the visibility and display only visible or partial visible parts.

Relevant stuffs:

[Update 2013.10.05] Possible solution:

/////////////////////////////// // Original report I have a sample PDF which appears to be scanned pages. The html produced has both images and text for each page - it should just be one of text xor images, not both.

pdf: https://dl.dropbox.com/u/31309918/dd/F3Znx0Qodh.pdf

html output: https://dl.dropbox.com/u/31309918/dd/F3Znx0Qodh.html

I checked the FAQ and looked through the command line options but didn't discover anything. I'm not sure if there's something I missed. Thanks for reading.

-Raza

coolwanglu commented 11 years ago

Confirmed and working on it.

coolwanglu commented 11 years ago

This pdf first displays the text, then the scanned image on top of the text. Such that the real text are covered, invisible, but still selectable.

pdf2htmlEX currently cannot detects this, it always tries to grab all text and put them on the top text layer.

I'll try to find a workaround for this.

coolwanglu commented 11 years ago

Is it true (or very common) that, for scanned pdf files, all text are hidden or covered by the scanned image. If so I may add an option like 'hide-text' for a workaround.

EDIT: I mean you can actually add more text there, above the images, with any PDF manipulation tool. So --hide-text may break the PDF again. But it's OK if this is not common.

razamobin commented 11 years ago

I'm not sure how common it is. I believe this kind of PDF is created when you OCR a scanned document, so that when using a PDF viewer, you can search on text and it will highlight as expected because the text is almost exactly behind the scanned version of the same text.

jahewson commented 11 years ago

It's relatively common, but was a poor decision on the part of the OCR program - it should have used hidden text rather than just placing an image over the text. Obviously it's too late to do anything about that.

Poppler's pdftohtml has some code to handle this specific problem at line 522 of HtmlOutputDev.cc which is probably a good starting point.

522   //----- discard duplicated text (fake boldface, drop shadows)
523   if( !complexMode )
524   { /* if not in complex mode get rid of duplicate strings */
525     HtmlString *str3;
526     GBool found;
527     while (str1)
528     {
529         double size = str1->yMax - str1->yMin;
530         double xLimit = str1->xMin + size * 0.2;
531         found = gFalse;
532         for (str2 = str1, str3 = str1->yxNext;
533             str3 && str3->xMin < xLimit;
534             str2 = str3, str3 = str2->yxNext)
535         {
536             if (str3->len == str1->len &&
537                 !memcmp(str3->text, str1->text, str1->len * sizeof(Unicode)) &&
538                 fabs(str3->yMin - str1->yMin) < size * 0.2 &&
539                 fabs(str3->yMax - str1->yMax) < size * 0.2 &&
540                 fabs(str3->xMax - str1->xMax) < size * 0.2)
541             {
542                 found = gTrue;
543                 //printf("found duplicate!\n");
544                 break;
545             }
546         }
547         if (found)
548         {
549             str2->xyNext = str3->xyNext;
550             str2->yxNext = str3->yxNext;
551             delete str3;
552         }
553         else
554         {
555             str1 = str1->yxNext;
556         }
557     }       
558   } /*- !complexMode */
coolwanglu commented 11 years ago

No, in our case, one is text, the other is on image, so they are not duplicate

On Sat, Feb 2, 2013 at 11:23 PM, John Hewson notifications@github.comwrote:

It's relatively common, but was a poor decision on the part of the OCR program - it should have used hidden text rather than just placing an image over the text. Obviously it's too late to do anything about that.

Poppler's pdftohtml has some code to handle this specific problem at line 522 of HtmlOutputDev.cchttp://fossies.org/dox/poppler-0.22.0/HtmlOutputDev_8cc_source.html#l00522which is probably a good starting point.

522 //----- discard duplicated text (fake boldface, drop shadows)523 if( !complexMode )524 { /* if not in complex mode get rid of duplicate strings /525 HtmlString str3;526 GBool found;527 while (str1)528 {529 double size = str1->yMax - str1->yMin;530 double xLimit = str1->xMin + size * 0.2;531 found = gFalse;532 for (str2 = str1, str3 = str1->yxNext;533 str3 && str3->xMin < xLimit;534 str2 = str3, str3 = str2->yxNext)535 {536 if (str3->len == str1->len &&537 !memcmp(str3->text, str1->text, str1->len * sizeof(Unicode)) &&538 fabs(str3->yMin - str1->yMin) < size * 0.2 &&539 fabs(str3->yMax - str1->yMax) < size * 0.2 &&540 fabs(str3->xMax - str1->xMax) < size \ 0.2)541 {542 found = gTrue;543 //printf("found duplicate!\n");544 break;545 }546 }547 if (found)548 {549 str2->xyNext = str3->xyNext;550 str2->yxNext = str3->yxNext;551 delete str3;552 }553 else554 {555 str1 = str1->yxNext;556 }557 } 558 } /- !complexMode */

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/64#issuecomment-13032004.

jahewson commented 11 years ago

Oh I see... that's annoying. I thought the image was a Type 3 font, but it's not - it really is an image.

jmbowman commented 11 years ago

A PDF file can have multiple layers, and layers containing images and text can be intermixed. I've attached a screenshot of a PDF and the output pdf2htmlEX currently generates that shows a more general example of the problem (look at the stack of receipts).

PDF: https://dl.dropbox.com/u/4804331/Layers_PDF.png HTML: https://dl.dropbox.com/u/4804331/Layers_HTML.png

Other than that, it did a remarkably good job of replicating that page.

coolwanglu commented 11 years ago

@jmbowman, Thanks for the info. Yes, the current design of pdf2htmlEX maybe too naive, maybe these will fix it somewhat

but it might be slow and ugly..

btw, can I have that PDF for debugging?

jmbowman commented 11 years ago

Here's that one page of the PDF for testing: https://dl.dropbox.com/u/4804331/layers_bug.pdf

I think that collapsing all of the image layers into a single image is usually a good optimization, except when it breaks like this. I guess one solution would be to have options for always collapsing (smallest), always preserving layers (most correct), or only preserving layers for specific pages which you know you'll need it for (best results with extra effort). Automatically figuring out which pages those are would be nice, but could be a separate improvement.

coolwanglu commented 11 years ago

Actually pdf2htmlEX is pdf-to-image with text extracted.

Bsides layers, clipping path is the biggest problem, an image in rectangle may be displayed as a circle due to the clipping path, which cannot be done easily in HTML.

Still looking for a solution.

On Wed, Feb 6, 2013 at 1:39 AM, jmbowman notifications@github.com wrote:

Here's that one page of the PDF for testing: https://dl.dropbox.com/u/4804331/layers_bug.pdf

I think that collapsing all of the image layers into a single image is usually a good optimization, except when it breaks like this. I guess one solution would be to have options for always collapsing (smallest), always preserving layers (most correct), or only preserving layers for specific pages which you know you'll need it for (best results with extra effort). Automatically figuring out which pages those are would be nice, but could be a separate improvement.

— Reply to this email directly or view it on GitHubhttps://github.com/coolwanglu/pdf2htmlEX/issues/64#issuecomment-13141583.

coolwanglu commented 11 years ago

@razamobin A new option --fallback is now available, which make PDF files rendered as image plus hidden text. Usually this would increase the output size, but not for scanned PDFs. So please give it a try.

duanyao commented 10 years ago

@coolwanglu I have implemented "covered text handling" initially. Characters covered by images are detected and (1) are made transparent in text layer, (2) are drawn in background layer.

There are still things to do:

coolwanglu commented 10 years ago

@duanyao Cool! I'm a little bit worried about the performance, you might want to take a look at rtree in boost. Or you can leave the interface flexible such that I can fix it later. And It might not be a good idea to create a separate char_covere array, would make it more difficult to optimized in the future. Currently it's similar as in PDF, where we record text and state changes. But it's not elegant either. Probably I need to make it an array and storing the state for each character.

duanyao commented 10 years ago

Thanks for recommending rtree, I'll take a look at it later.

coolwanglu commented 10 years ago

I think you can keep chars_covered for now, as probably I don't have time to rework the data structure. I wonder if std::vector<char> would be better, because I remember that std::vector<bool> will use bitset for saving memory, but rather slow.

Can you create a separate class for the hittest? I don't want everything inside HTMLRenderer. Besides, it will be easier to adapt to other data structures.

Rtree should give an average time of O(n logm), and O(n sqrt(m)) at least. 0.2s is not fast, as I see sites using pdf2htmlEX to convert thousands of PDF, or a single file containing thousands of pages. But other the other hand, font & image processing may be even slower, so this may not be the bottleneck.

Can you create an option for this? This is experimental right now. Somebody might prefer performance.

coolwanglu commented 10 years ago

@duanyao Probably you could create a PR when you think it's ready, and it'll be a better place to discuss. Thanks!

duanyao commented 10 years ago

Sure. Firstly I want to fix the broken SplashBackgroundRenderer, and this may introduce conflicts with pending PR 360, so I want to do it after #360 it merged.

zogwarg commented 10 years ago

I don't know if this is a related problem; text which is completely transparent in the pdf (but selectable); appears in the html result (most telling on page 15): http://zogwarg.free.fr/pdftohtml/1_NOR.pdf http://zogwarg.free.fr/pdftohtml/1_NOR.html

EDIT:

I get it better now, in fact pdf.js works like fallback mode by rendenring the page has an image and having all the text be transparent,

Is the non rendered text hidden behind images in the actual pdf?

zogwarg commented 10 years ago

I tested the covered text, and it didn't work

duanyao commented 10 years ago

@zogwarg, how did you test the "covered text"? Did you build covered_text_handling branch with cmake -DENABLE_SVG=ON, and pass --correct-text-visibility 1 at runtime? It is off by default.

I tested your PDF's p15, --correct-text-visibility 1 worked as expected.

zogwarg commented 10 years ago

I didn't have -DENABLE_SVG=ON, thanks i'll test it again

bilalmughal commented 9 years ago

@duanyao Did you try converting this document All the text become invisible when used with --correct-text-visibility 1 and --bg-format svg

Output html can be downloaded from here

coolwanglu commented 9 years ago

I'm closing this issue as there's already an implementation for this. Please create new issue with sample files if it's not working well.

duanyao commented 9 years ago

@bilalmughal I can reproduce your issue. However it is not related to --correct-text-visibility, but seems a problem of poppler(or cario)'s SVG renderer. Using poppler's pdftocario -svg command to convert your file, the output SVG file also looks blank in chrome, firefox, and inkscape, though shows some texts in gnome image viewer. I suggest you to report the pdftocario -svg bug to poppler (https://bugs.freedesktop.org/buglist.cgi?quicksearch=poppler&list_id=457322) if you can.

bilalmughal commented 9 years ago

@duanyao Thanks for looking into it, i have reported it to poppler.

duanyao commented 9 years ago

@bilalmughal could you post the link to the poppler's bug so that we can track the progress?

bilalmughal commented 9 years ago

@duanyao sure here it is https://bugs.freedesktop.org/show_bug.cgi?id=86093