coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.37k stars 1.84k forks source link

Add option to specify path to external resources #162

Open ChrisCinelli opened 11 years ago

ChrisCinelli commented 11 years ago

Same scenario discussed previously. We host on Amazon S3 and we load through ajax the HTML and write it in an iframe with document.write. Everything works except that some converted documents are a few megabytes and it takes ages to load.

The solution would be to have images and fonts not embedded. The problems is that since the URL of the iframe is the same of my server and the images are on S3 there is a path mismatch so they are not loaded properly. I could parse the HTML client side and do the substitution but it would be cleaner to have a parameter (resource_path maybe ?) that can be specified on command line.

So for example I could use --resource_path="http://mybucket.s3.amazonaws.com/thisdocumentfolder" and all the resources in the documented would have this path prefixed to the external resources.

I hope it is clear what I am asking for. If you do not have a lot of time and you can point me to the files that need to be modified, I can do the fix.

coolwanglu commented 11 years ago

Yes, I see what you mean. You may want to take a look at the embed_file function in src/HTMLRenderer/general.cc. resource_path looks a little bit confusing to me, at the first glance I thought that it's about the resources of pdf2htmlEX. Maybe resource_prefix ?