coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.36k stars 1.84k forks source link

Google Custom Searchability #520

Open GarrettHartley opened 9 years ago

GarrettHartley commented 9 years ago

Is there a way to optimize the conversion so that the resulting HTML is (google)searchable?

The resulting HTML code is not searchable by google. I realized this when trying to create a custom search engine for the sites I have converted using this tool.

This seems like a fatal flaw because one of the main benefits for html over pdf is SEO.

This is the google custom search tool I am talking about: https://cse.google.com/cse/

Converted with pdf2htmlEX and NOT searchable with google custom search:

http://education.byu.edu/sites/default/shared/code/SEEL/Library/File_Structure/pre_k/letter-knowledge/a/aachoo-andy/html/pre_k--letter-knowledge--a--aachoo-andy--lesson-plan.html

Converted using DreamWeaver and works with google custom search:

http://education.byu.edu/seel/LessonPlans/Pre-K/Alliteration/A/a_aachoo_andy_alliteration_activity_plans_and_resources.html?iframe=true&width=100%&height=100%;

duanyao commented 9 years ago

I know very little about SEO. Do you think it will be helpful to just change <div> used by pdf2htmlEX to more semantic <hN> and <p>?

GarrettHartley commented 9 years ago

I don't think that would change anything. I'm not familiar with SEO either.

I've been told that the main reasons HTML is preferred to PDF is because HTML supports dynamic content, such as links, and that it is more easily searchable and recognizable by web-crawlers.

Is there a setting for this converter that still maintains links ( ) ? I noticed that this converter didn't preserve the functionality of my links.

On the bright side, this PDF to HTML converter looks exactly the same as a pdf!

But it also seems to function the same as a PDF as well. If so, what's the point?

duanyao commented 9 years ago

pdf2htmlEX should be able to convert links in PDF to HTML links;. If not, you can file an issue.

PDFs are not supported by all browsers natively (maybe never will), so if you want your PDFs to be reliably accessible on the web, converting to HTML is a good idea.

If you want to add more dynamic contents, you can always edit the converted HTML/JS.

GarrettHartley commented 9 years ago

Ok, yeah. That makes sense.

Will you be looking into this SEO issue?

I will let you know if I find anything.

duanyao commented 9 years ago

I'm afraid I don't have necessary environment to do trial and error on SEO. If you can figure out why the output of pdf2htmlEX is not searchable by google, maybe I can improve it.

coolwanglu commented 9 years ago

If --split-pages is not enabled, text should be static in the HTML.

KrishnaPG commented 8 years ago

For the SEO, the basic need is keyword identification, which is difficult when words get split as individual characters. For example, consider the below html generated:

<div class="t m0 x5 hb y14 ff1 fsa fc7 sc0 ls0 ws0">Techno<span class="_ _9"></span>logy Stack </div>

The word Technology is split in the middle with a span tag. Which makes it impossible for the search engines to classify the document as 'technology' document. The main problem here is, the induced span tag is just accounting for 1.09 px which is not really worth the effort for HTML.

For example, here is the rendered html in browser (after turning off the span tag): image

In PDF the 1.09 px could make large difference for different devices, but in HTML (which is essentially responsive, meaning different output for different devices), perhaps these intermediate span elements below certain threshold should be ignored and not be output (especially when they are breaking words).

One possible approach is:

*  eliminating/minimizing the tags insertion in the mid of text (where there is no white-space)
*  not generating `span` margins below certain (configurable?) thresholds (e.g. 5 px)
*  while, retaining the current pixel-level accuracy for non-textual content (images, control sequences etc.)

The second requirement for SEO is, using contextual tags, such as h1, h2, h3...

Presently the generated output uses div classes with varying font-size heights specified (such as <div class="... h1 t m0 x1...">).

Instead, using the <h1> tags with the same classes, such as <h1 class='...h1 t m0 x1 ...'> in place of div tag is one good option to consider here (after sorting the font-sizes and assigning them in the decreasing order)

The next important SEO features are using title and alt attributes for the links and images. But not sure if that would be easy without some external help.

There are other SEQ requirements such as responsiveness and page-loading speed etc. which I think can be tackled by the users.

one good way would be to let users choose between pixel-perfect (less SEO) vs text-perfect (good SEO, but may not be pixel-perfect as the PDF) when generating the output

Like the zip and H.264 codecs work, profiles on the scale of 0 to 9 mapped to pixel-perfect to text-perfect is one good way to go.

Pdf2HtmlEx might be already implementing most of these in one form or the other - its just a matter fine-tuning and figuring out which one works for SEO.

duanyao commented 8 years ago

@KrishnaPG Thanks for the detailed suggestion!

However, I would be suprised if search engines couldn't handle noise of span tags -- just removing these tags while keeping text nodes should produce correct text. Do you have any references on this?

Using <hN> and <p> instead of <div> and adding <title> are doable, however I'm not sure how to test the effect. If anyone can test, I suggest manually editing (or scripting if you can) the output of pdf2htmlEx and see what will happen.

fmalina commented 8 years ago

You might want to look at https://github.com/fmalina/transcript, a post processing tool for PDFtoHMLEx output providing semantic HTML based on visual design conventions.

duanyao commented 8 years ago

@fmalina Interesting, thanks! @GarrettHartley can you try https://github.com/fmalina/transcript and see whether it makes difference?