alexadam / save-as-ebook

Save a web page/selection as an eBook (.epub format) - a Chrome/Firefox/Opera Web Extension
MIT License
1.1k stars 70 forks source link

Parse error - o:p tag I think #60

Open malcb opened 2 years ago

malcb commented 2 years ago

I tried to convert a web page and got parse error. I added a html validator to check the web site and that suggested that the problem might be </o:p> tags. These are not standard tags but are added by MS word (typical!). I saved the web page and stripped </o:p> and then tried again with the local file. This time there was no parse error. Hence it looks like the problem is MS, as usual. Perhaps the fix would be to ignore unknown tags rather than throwing an error.

malcb commented 2 years ago

The same parse error occurs when the web page has errors too. This can be invisible errors, that is missing closing tags, corrupt tags, or similar that the browser overcomes so that the page still renders ok. I think the browser must just ignore the error so the text still displays ok, hence the error is invisible, but the parser in save-as-ebook throws out the text so the ebook doesn't match the web page.

I have a work around for this for anyone having similar problems. The extension rewriter allows you to set up rules for rewriting a page and these rules apply to changing the html too. Rewriter seems to affect the the whole page, not just the visible text. Hence rewriter can be set to remove all and </o:p> tags so that save-as-ebook will work ok (unless the page has other errors, which is how I found that this was another problem). Rewriter can be restricted to specific URLs so you can limit the effects to just where you need it. The matching and replacing use regex so it is very powerful if need be but replacing just the o:p tags does need anything complicated.