causes XML parser to bail (bookworm/library/models.py:829)

anselmorenato / threepress

Automatically exported from code.google.com/p/threepress

Other

0 stars 0 forks source link

  causes XML parser to bail (bookworm/library/models.py:829) #163

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. create an epub with non-breaking space character entities in the html
2. load the epub and try to view it

(I think) Because bookworm is using the lxml.etree.XML routine to parse the 
source data, (some) 
character entities don't work, like &nbsp; instead of failing gracefully, the 
parser bails and says 
the xhtml is invalid. 

This one had me stumped for a while. I double checked my xhtml source with the 
w3c validator 
and it came back clean (which never happens, mind you!)

A bit of googling found this: 
http://osdir.com/ml/python.lxml.devel/2008-07/msg00134.html

I changed line 829 of library/models.py relax the XMLParser rules, and the epub 
that failed 
before, now works:

            xhtml = etree.XML(f, etree.XMLParser(resolve_entities=False))

Not sure of the repurcusions of this, so just an idea, rather than a request.

Please provide any additional information below.

Original issue reported on code.google.com by steven.m...@gmail.com on 27 Aug 2009 at 6:59

GoogleCodeExporter commented 9 years ago

Does the epub validate with epubcheck? It shouldn't.

Bookworm is pretty lenient though, and this general case should already be 
handled
(see line 839).

It's probably just throwing a different exception than the one we're already
catching. If you can find out the exact exception that the entity inclusion is
throwing, add it to the list in 839.

Original comment by liza31337@gmail.com on 27 Aug 2009 at 7:13

GoogleCodeExporter commented 9 years ago

> Does the epub validate with epubcheck? It shouldn't.
it probably doesn't - I removed the epubcheck check :o

> Bookworm is pretty lenient though, and this general case should already be 
handled
> (see line 839).
> It's probably just throwing a different exception than the one we're already
> catching. If you can find out the exact exception that the entity inclusion is
> throwing, add it to the list in 839.
Yes, it does get caught on 839, but BeautifulSoup bails when it hits   as 
well, although I didn't follow 
this through very far.

Will do some more testing and report back!

Original comment by steven.m...@gmail.com on 27 Aug 2009 at 7:35

GoogleCodeExporter commented 9 years ago

If we wanted a lot of entities:
http://www.oasis-open.org/docbook/specs/wd-docbook-xmlcharent-0.3.html

Original comment by abdela...@gmail.com on 27 Aug 2009 at 1:35

GoogleCodeExporter commented 9 years ago

BeautifulSoup should _absolutely_ be able to handle a real nbsp.  Are you sure 
it's
not typoed? 

>>> import lxml.html.soupparser as parser
>>> x = parser.fromstring('<html><body>hello  </body></html>')
>>> import lxml.etree as etree
>>> etree.tostring(x)
'<html><body>hello  </body></html>'

Original comment by liza31337@gmail.com on 27 Aug 2009 at 3:56

GoogleCodeExporter commented 9 years ago

I think I must have got something else wrong on this, because running my tests 
against the latest trunk shows 
no problems with the   - lxml.etree.XML fails but BeautifulSoup deals with it 
just fine. 

I think this can be closed - nothing to see, move along here :)

Original comment by steven.m...@gmail.com on 27 Aug 2009 at 11:38

Changed state: Invalid

anselmorenato / threepress

&nbsp; causes XML parser to bail (bookworm/library/models.py:829) #163

causes XML parser to bail (bookworm/library/models.py:829) #163