Closed duplaja closed 1 year ago
Yeah, in theory EPUBs (v2 anyway) are XHTML, which is supposed to be XML. In practice, it's HTML which is significantly less rigid.
However, the files are declared to be XHTML still to make epub checkers and some readers happy. Actually using an XML parser fails laughably hard on virtually all HTML pages.
So I guess you're left with ignoring it, because I don't think it's a good idea for FFF to start filtering warnings.
Well, you can filter out this one particular warning. I will send a patch.
@mcepl can you send me this patch as well please?
And can this be documented somewhere please - I have the log right now as well, but it works fine.
@mcepl can you send me this patch as well please?
From 8553619c08c02cbe9fa958812bf561d793c74ff5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mat=C4=9Bj=20Cepl?= <mcepl@cepl.eu>
Date: Sat, 10 Jun 2023 14:17:15 +0200
Subject: [PATCH] Ignore BS4 XML compatibility warning.
---
fanficfare/epubutils.py | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/fanficfare/epubutils.py b/fanficfare/epubutils.py
index 858f5640..a31bf498 100644
--- a/fanficfare/epubutils.py
+++ b/fanficfare/epubutils.py
@@ -10,6 +10,7 @@ logger = logging.getLogger(__name__)
import os
import re
+import warnings
from collections import defaultdict
from zipfile import ZipFile, ZIP_STORED, ZIP_DEFLATED
from xml.dom.minidom import parseString
@@ -460,8 +461,10 @@ def make_soup(data):
## soup and re-soup because BS4/html5lib is more forgiving of
## incorrectly nested tags that way.
- soup = bs4.BeautifulSoup(data,'html5lib')
- soup = bs4.BeautifulSoup(unicode(soup),'html5lib')
+ with warnings.catch_warnings():
+ warnings.simplefilter("ignore")
+ soup = bs4.BeautifulSoup(data,'html5lib')
+ soup = bs4.BeautifulSoup(unicode(soup),'html5lib')
for ns in soup.find_all('fff_hide_noscript'):
ns.name = 'noscript'
--
2.44.0
(git am
should be happy with this)
That's actually a lot more targeted than I thought.
Plus, I couldn't get bs4 to emit any other warning or error regardless of how I mangled the HTML source. So suppressing warnings there doesn't look like we lose anything useful.
Thanks mcepl. I've included this patch and uploaded a new CLI version with it.
uploaded a new CLI version with it
where can I find it?
I neither saw one in releases nor pip 🤔
Thanks mcepl. I've included this patch and uploaded a new CLI version with it.
Just a nit (and I don’t ask you to change anything): if you used git am
you would get proper attribution of the commit to me and the correct date of it.
I confess I don't collab with other devs often enough to know all the possibilities. But I'll try to remember if it comes up again.
Hello! I recently started getting the following bs4 warning, when using the CLI tool to update existing stories:
/python3.10/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument
features="xml"into the BeautifulSoup constructor. warnings.warn(
It does seem to be working, but figured I'd pass this along.