JimmXinu / FanFicFare

FanFicFare is a tool for making eBooks from stories on fanfiction and other web sites.
Other
746 stars 158 forks source link

Python CLI Beautiful Soup 4 Warning #894

Closed duplaja closed 1 year ago

duplaja commented 1 year ago

Hello! I recently started getting the following bs4 warning, when using the CLI tool to update existing stories:

/python3.10/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argumentfeatures="xml"into the BeautifulSoup constructor. warnings.warn( It does seem to be working, but figured I'd pass this along.

JimmXinu commented 1 year ago

Yeah, in theory EPUBs (v2 anyway) are XHTML, which is supposed to be XML. In practice, it's HTML which is significantly less rigid.

However, the files are declared to be XHTML still to make epub checkers and some readers happy. Actually using an XML parser fails laughably hard on virtually all HTML pages.

So I guess you're left with ignoring it, because I don't think it's a good idea for FFF to start filtering warnings.

mcepl commented 1 year ago

Well, you can filter out this one particular warning. I will send a patch.

rala72 commented 4 months ago

@mcepl can you send me this patch as well please?

And can this be documented somewhere please - I have the log right now as well, but it works fine.

mcepl commented 4 months ago

@mcepl can you send me this patch as well please?

From 8553619c08c02cbe9fa958812bf561d793c74ff5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mat=C4=9Bj=20Cepl?= <mcepl@cepl.eu>
Date: Sat, 10 Jun 2023 14:17:15 +0200
Subject: [PATCH] Ignore BS4 XML compatibility warning.

---
 fanficfare/epubutils.py | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fanficfare/epubutils.py b/fanficfare/epubutils.py
index 858f5640..a31bf498 100644
--- a/fanficfare/epubutils.py
+++ b/fanficfare/epubutils.py
@@ -10,6 +10,7 @@ logger = logging.getLogger(__name__)

 import os
 import re
+import warnings
 from collections import defaultdict
 from zipfile import ZipFile, ZIP_STORED, ZIP_DEFLATED
 from xml.dom.minidom import parseString
@@ -460,8 +461,10 @@ def make_soup(data):

     ## soup and re-soup because BS4/html5lib is more forgiving of
     ## incorrectly nested tags that way.
-    soup = bs4.BeautifulSoup(data,'html5lib')
-    soup = bs4.BeautifulSoup(unicode(soup),'html5lib')
+    with warnings.catch_warnings():
+        warnings.simplefilter("ignore")
+        soup = bs4.BeautifulSoup(data,'html5lib')
+        soup = bs4.BeautifulSoup(unicode(soup),'html5lib')

     for ns in soup.find_all('fff_hide_noscript'):
         ns.name = 'noscript'
-- 
2.44.0

(git am should be happy with this)

JimmXinu commented 4 months ago

That's actually a lot more targeted than I thought.

Plus, I couldn't get bs4 to emit any other warning or error regardless of how I mangled the HTML source. So suppressing warnings there doesn't look like we lose anything useful.

Thanks mcepl. I've included this patch and uploaded a new CLI version with it.

rala72 commented 4 months ago

uploaded a new CLI version with it

where can I find it?
I neither saw one in releases nor pip 🤔

JimmXinu commented 4 months ago

Test Versions

mcepl commented 4 months ago

Thanks mcepl. I've included this patch and uploaded a new CLI version with it.

Just a nit (and I don’t ask you to change anything): if you used git am you would get proper attribution of the commit to me and the correct date of it.

JimmXinu commented 4 months ago

I confess I don't collab with other devs often enough to know all the possibilities. But I'll try to remember if it comes up again.