coolwanglu / pdf2htmlEX

Convert PDF to HTML without losing text or format.
http://coolwanglu.github.com/pdf2htmlEX/
Other
10.35k stars 1.84k forks source link

pdf2htmlEX - output html source code #761

Open MBhat6 opened 6 years ago

MBhat6 commented 6 years ago

I have a issue with pdf2htmlEx output. I created a html output for my pdf document, and it renders nicely. But in the source code I see that the words are broken and are separated with $, ! and spaces and and spans. In fact at times there are lots of ! And $ signs.

In my program I generate the html file and I search for keywords in the text and put tags to highlight them. But because of the broken words, this output doesn’t let me search my keywords. The browser search works great however.

Any suggestions or work around is appreciated

ebbandari commented 6 years ago

I face the same issue. Not sure why there are so many $ and ! signs, and some words have space addeded in the middle. The text seems to be in one line too. Is there a way to create cleaner files, or convert this file to a cleaner html?

mortenmoulder commented 6 years ago

I have the same issue here. I need to replace a bunch of words in the HTML file, but because of these <span>-tags everywhere, I can't search and replace.

I wonder if it's possible to stop that from happening in the source. So it won't break up words.

ebbandari commented 6 years ago

You can remove to spans intelligently, but be careful. The licence is non-commercial :-( We are going to move on. E

-- Esfandiar Bandari, PhD, MBA CEO Textnomics Inc. e@resumesort.com e@textnomics.com, ebbandari@alumni.stanford.edu ebbandari@stanford.alumni.edu, e.bandari@cantab.net, e.bandari@gmail.com Cell: (650) 862-8351 skype: ebbandari & gtalk: e.bandari http://www.linkedin.com/in/ebandari

On Thu, Apr 5, 2018 at 4:00 AM, Morten Møller notifications@github.com wrote:

I have the same issue here. I need to replace a bunch of words in the HTML file, but because of these -tags everywhere, I can't search and replace.

I wonder if it's possible to stop that from happening in the source. So it won't break up words.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/coolwanglu/pdf2htmlEX/issues/761#issuecomment-378898410, or mute the thread https://github.com/notifications/unsubscribe-auth/AICPAOIDo5PLeaOBuy0mDRAa8P6Pte_Bks5tlflLgaJpZM4S5lok .

mortenmoulder commented 6 years ago

@ebbandari Non-commercial? As far as I know, GPLv3 licensed software can be used for commercial use as much as you want.

ebbandari commented 6 years ago

Ehhhh... I would make sure to check that with a lawyer. And I would love to hear more, when you find out. Thanks.

-- Esfandiar Bandari, PhD, MBA CEO Textnomics Inc. e@resumesort.com e@textnomics.com, ebbandari@alumni.stanford.edu ebbandari@stanford.alumni.edu, e.bandari@cantab.net, e.bandari@gmail.com Cell: (650) 862-8351 skype: ebbandari & gtalk: e.bandari http://www.linkedin.com/in/ebandari

On Thu, Apr 5, 2018 at 10:07 AM, Morten Møller notifications@github.com wrote:

@ebbandari https://github.com/ebbandari Non-commercial? As far as I know, GPLv3 licensed software can be used for commercial use as much as you want.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coolwanglu/pdf2htmlEX/issues/761#issuecomment-379007587, or mute the thread https://github.com/notifications/unsubscribe-auth/AICPABJkFNjy79S7txwfH_LKJ4qtviO2ks5tlk9qgaJpZM4S5lok .

mortenmoulder commented 6 years ago

@ebbandari https://gist.github.com/kn9ts/cbe95340d29fc1aaeaa5dd5c059d2e60

ebbandari commented 6 years ago

What you supplied was the someone's personal opinion and not a legal one. Even then, from your own link,

  1. If you dare build your business solely from this code, you risk open-sourcing the whole code base.6. If you modify it, you have to indicate changes made to the code.7. Any modifications of this code base MUST be distributed with the same license, GPLv3.

Here is a better link from Gnu FAQ itself: https://www.gnu.org/licenses/gpl-faq.html#GPLInProprietarySystem

Finally, and I do mean FINALLY, proceed as you see fit. Elviser is using htmlEX to publish their journal articles online. Caveat mentor, they do not have a software to release or make public. Ergo, your milage may vary. Look, the law is the law and you cannot wish it away. We decided not to risk it.

Respectfully, Esfandiar

Esfandiar Bandari, PhD, MBA CEO Textnomics Inc. e@resumesort.com e@textnomics.com, ebbandari@alumni.stanford.edu ebbandari@stanford.alumni.edu, e.bandari@cantab.net, e.bandari@gmail.com Cell: (650) 862-8351 skype: ebbandari & gtalk: e.bandari http://www.linkedin.com/in/ebandari

On Thu, Apr 5, 2018 at 11:21 AM, Morten Møller notifications@github.com wrote:

@ebbandari https://github.com/ebbandari https://gist.github.com/kn9ts/ cbe95340d29fc1aaeaa5dd5c059d2e60

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coolwanglu/pdf2htmlEX/issues/761#issuecomment-379030954, or mute the thread https://github.com/notifications/unsubscribe-auth/AICPAObz8hFKe3wBYg0nYZWrA10GESeXks5tlmCwgaJpZM4S5lok .

mortenmoulder commented 6 years ago

@ebbandari Exactly. I can use it for whatever I want, but if I go out and make a "PDF to HTML converter" and use pdf2htmlEX as my tool, my business is built solely from the code of pdf2htmlEX (if pdf2htmlEX did not exist, neither would my product).

As long as we use them as tools, we can use them as much as we want.

subodhkalika commented 6 years ago

@MBhat6 Yes there is a solution. I had the same issue. I had to highlight the keywords in the HTML document which was rendered correctly but the words were broken by span.

I made use of BeautifulSoup, a python package, to parse the html and mark(highlight) the keywords.