coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.37k stars 1.84k forks source link

Feature Request: Limit scope of CSS styles #261

Open sathomas opened 10 years ago

sathomas commented 10 years ago

Although it doesn't matter for standalone web pages, in those cases where pdf2htmlEX content is being embedded in another page, it would be nice to have an option to limit the scope of CSS styles. For example, the current rule in base.css:

span {
    display: inline-block;
}

affects all <span> elements on the parent page, even those outside of the pdf2htmlEX content. It would be helpful if it were possible (e.g. as a run-time option) to enclose all the pdf2htmlEX styles in a parent selector, e.g.:

.pdf2htmlex span {
    display: inline-block;
}

It wouldn't even have to be an option if, e.g., the pdf2htmlex class were simply added to the <body> tag.

coolwanglu commented 10 years ago

Actually it was added at some time, but it will increase the specificity, such that I'll have to add the prefix to all the generated CSS rules — these might increased the size by a few kb.

I guess that the only problematic rule would be span, how about changing span to .t span, does it work?

sathomas commented 10 years ago

@coolwanglu, you're definitely right on both counts:

  1. The <span> rule in base.css is the most problematic, as that will almost certainly conflict with other elements on a page. The suggestion to add a t class will definitely help.
  2. Adding a unique class to all of the CSS selectors will definitely increase the size of the styles. As a check, I looked at the Scientific Paper example as something of a worst case. The total number of selectors in all three <style> blocks is 2292, so adding a 12-byte class selector to all of them (e.g. `.pdf2htmlex ' including the trailing space) will increase the file size by about 27,500 bytes.

OTOH, there are some additional factors to consider:

  1. The total size of the example (without the unique class) is about 1.8 Mbytes, so adding the class only increases the total size by 1.5%. (For smaller and less complex documents, the increase would be a greater percentage.)
  2. Any good web server will be using gzip compression which reduces the absolute size of the files (including the additional class selector) by at least a factor of 2. (Even with compression, though, there would still be a significant increase, e.g. 14KB in the example.)
  3. There are some other rules in the base.css file that are fairly common in web pages (e.g. #sidebar or .loading-indicator) and could well conflict with a larger web page.
  4. Even for the abbreviated classes that pdf2htmlex uses, it's impossible to know for certain that there won't be conflicts. I suppose the worst-case possibility would be a web page that wanted to display two translated PDF documents side-by-side. In that case nearly all the style rules would conflict.

I can't read C++ well enough to assess how much extra effort it would take to add a command-line option to include (or not) an additional CSS selector, but if that's feasible, it might be the most straightforward way to deal with all concerns. (Well, it would at least leave the decision up to the implementation).

coolwanglu commented 10 years ago

@sathomas For now it's easier to specify different class names. You can just change them in css_class_names.cmakelists.txt and rebuild the project.

The top-level id/class (e.g. loading-indicator you've mentioned) are considered as demos, which can be changed in the manifest file, I also left a few interface in the pdf2htmlEX js object for specifying different names.

But indeed there are likely issues when putting two html files together, in which case even adding a prefix selector won't help &mdash we need two different prefixes. If the prefix is determined by the user, or from the name of the PDF file, the difficulty would be to convert existing rules in base.css

coolwanglu commented 10 years ago

@sathomas Another option is to add a new macro in the manifest file, and to prepend it to all generated CSS classes, but the prefix has to be determined the in the compile time, which doesn't sound to be useful.

I've not come up with any good time of processing the .css style though,

Btw, did .t span work for you?

sathomas commented 10 years ago

The change .t span is definitely a big help; it resolves the most obvious issues. I'm not sure I'd recommend a compile time option since, as you note, that wouldn't completely address the problem. The simplest approach might be to just post-process the css files by running them through LESS.

nicolaasmatthijs commented 10 years ago

@coolwanglu : Is the .t span fix something that can be included in a future release? It is the most obvious rule that can conflict with the page the document is embedded in and would avoid a lot of potential issues.

coolwanglu commented 10 years ago

@sathomas @nicolaasmatthijs OK I just added it to base.css

trnelson commented 10 years ago

Sort of ran into this issue myself and was curious if you could clarify. I'm running 0.11 on an Ubuntu server, but the output generates a global span selector in the CSS. It looks like the release notes specify that this was fixed in 0.11. Just curious if I've missed something. Thanks!

coolwanglu commented 10 years ago

@trnelson One possible reason is that you are using a development version, that's released before 0.11 is officially released. Did you get it from the v0.11 tag?

trnelson commented 10 years ago

Thanks @coolwanglu. I don't have it in front of me at the moment, but I definitely installed from the latest Ubuntu PPA for Saucy per the wiki. I know there are later versions available as well; when I do a --version it shows me 0.11, but it's possible it could have been a dev version (sorry, it was on a VM which I've currently backed up and stashed for the time being, but I'll be revisiting very soon.)

coolwanglu commented 10 years ago

@trnelson Oh, the PPA has not been update for a while #310

trnelson commented 10 years ago

@coolwanglu ooooh, okay I see. It looked to me like it was still version 0.11 which is noted in the release notes as fixed, but maybe I misunderstood. If I'm using that particular version, am I missing out on a lot? The software is pretty amazing and does exactly what I need it to, but unfortunately I'm on Ubuntu and not quite a Linux guru. Is there a way to run a more recent version on Ubuntu? Thanks again for your help!

coolwanglu commented 10 years ago

@trnelson For now I'd recommend to compile from source, although it might not be a comfortable process. Also please leave more comments to #310 instead of here. Thanks.

duanyao commented 10 years ago

I believe scoped CSS is a perfect solution for this type of issue, however only Firefox support it for now (http://caniuse.com/#search=scoped). There is a polyfill for other browsers, but I haven't tried.

Another solution is the good old iframe.

coolwanglu commented 10 years ago

@duanyao I think it'd be enough to prefix the css rules as @sathomas originally proposed. At first I worried about the performance, but maybe it's fine now.

duanyao commented 10 years ago

@coolwanglu conflict introduced by putting 2 generated htm together is still not resolved, as you discussed above. However the ability to add prefix to css is better than nothing.

coolwanglu commented 10 years ago

@duanyao How about a random prefix for each document? Not sure if this would be an overkill.

duanyao commented 10 years ago

I think random prefix is a good idea. We may let users to specify the length of prefix.

Helidium commented 9 years ago

@coolwanglu Is there any progress on this issue? I have a situation, where I need to display several pdf files on the same page.

duanyao commented 9 years ago

@Helidium No progress yet. Is iframe feasible in your case?

Helidium commented 9 years ago

@duanyao Not really, as individual documents represent parts of one whole document that must be displayed alltogether. If you can help me by stating where css file gets generated I can implement such functionality according to input parameter.

duanyao commented 9 years ago

@Helidium iframes can be considered as parts of their parent document. You may also consider merging those PDFs before converting to html, if they are actually parts of one document.

For css generation codes, look at all_manager.xxx.install(). Maybe you can add a random number to those classes for each conversion to avoid collision.