Open GarrettHartley opened 9 years ago
I know very little about SEO. Do you think it will be helpful to just change <div>
used by pdf2htmlEX to more semantic <hN>
and <p>
?
I don't think that would change anything. I'm not familiar with SEO either.
I've been told that the main reasons HTML is preferred to PDF is because HTML supports dynamic content, such as links, and that it is more easily searchable and recognizable by web-crawlers.
Is there a setting for this converter that still maintains links ( ) ? I noticed that this converter didn't preserve the functionality of my links.
On the bright side, this PDF to HTML converter looks exactly the same as a pdf!
But it also seems to function the same as a PDF as well. If so, what's the point?
pdf2htmlEX should be able to convert links in PDF to HTML links;. If not, you can file an issue.
PDFs are not supported by all browsers natively (maybe never will), so if you want your PDFs to be reliably accessible on the web, converting to HTML is a good idea.
If you want to add more dynamic contents, you can always edit the converted HTML/JS.
Ok, yeah. That makes sense.
Will you be looking into this SEO issue?
I will let you know if I find anything.
I'm afraid I don't have necessary environment to do trial and error on SEO. If you can figure out why the output of pdf2htmlEX is not searchable by google, maybe I can improve it.
If --split-pages
is not enabled, text should be static in the HTML.
For the SEO, the basic need is keyword identification, which is difficult when words get split as individual characters. For example, consider the below html generated:
<div class="t m0 x5 hb y14 ff1 fsa fc7 sc0 ls0 ws0">Techno<span class="_ _9"></span>logy Stack </div>
The word Technology
is split in the middle with a span tag. Which makes it impossible for the search engines to classify the document as 'technology' document. The main problem here is, the induced span
tag is just accounting for 1.09 px
which is not really worth the effort for HTML.
For example, here is the rendered html in browser (after turning off the span tag):
In PDF the 1.09 px
could make large difference for different devices, but in HTML (which is essentially responsive, meaning different output for different devices), perhaps these intermediate span
elements below certain threshold should be ignored and not be output (especially when they are breaking words).
One possible approach is:
* eliminating/minimizing the tags insertion in the mid of text (where there is no white-space)
* not generating `span` margins below certain (configurable?) thresholds (e.g. 5 px)
* while, retaining the current pixel-level accuracy for non-textual content (images, control sequences etc.)
The second requirement for SEO is, using contextual tags, such as h1, h2, h3...
Presently the generated output uses div classes
with varying font-size heights specified (such as <div class="... h1 t m0 x1...">
).
Instead, using the <h1>
tags with the same classes, such as <h1 class='...h1 t m0 x1 ...'>
in place of div
tag is one good option to consider here (after sorting the font-sizes and assigning them in the decreasing order)
The next important SEO features are using title
and alt
attributes for the links and images. But not sure if that would be easy without some external help.
There are other SEQ requirements such as responsiveness and page-loading speed etc. which I think can be tackled by the users.
one good way would be to let users choose between
pixel-perfect (less SEO)
vstext-perfect (good SEO, but may not be pixel-perfect as the PDF)
when generating the output
Like the zip
and H.264
codecs work, profiles on the scale of 0 to 9 mapped to pixel-perfect
to text-perfect
is one good way to go.
Pdf2HtmlEx might be already implementing most of these in one form or the other - its just a matter fine-tuning and figuring out which one works for SEO.
@KrishnaPG Thanks for the detailed suggestion!
However, I would be suprised if search engines couldn't handle noise of span tags -- just removing these tags while keeping text nodes should produce correct text. Do you have any references on this?
Using <hN>
and <p>
instead of <div>
and adding <title>
are doable, however I'm not sure how to test the effect. If anyone can test, I suggest manually editing (or scripting if you can) the output of pdf2htmlEx and see what will happen.
You might want to look at https://github.com/fmalina/transcript, a post processing tool for PDFtoHMLEx output providing semantic HTML based on visual design conventions.
@fmalina Interesting, thanks! @GarrettHartley can you try https://github.com/fmalina/transcript and see whether it makes difference?
Is there a way to optimize the conversion so that the resulting HTML is (google)searchable?
The resulting HTML code is not searchable by google. I realized this when trying to create a custom search engine for the sites I have converted using this tool.
This seems like a fatal flaw because one of the main benefits for html over pdf is SEO.
This is the google custom search tool I am talking about: https://cse.google.com/cse/
Converted with pdf2htmlEX and NOT searchable with google custom search:
http://education.byu.edu/sites/default/shared/code/SEEL/Library/File_Structure/pre_k/letter-knowledge/a/aachoo-andy/html/pre_k--letter-knowledge--a--aachoo-andy--lesson-plan.html
Converted using DreamWeaver and works with google custom search:
http://education.byu.edu/seel/LessonPlans/Pre-K/Alliteration/A/a_aachoo_andy_alliteration_activity_plans_and_resources.html?iframe=true&width=100%&height=100%;