JimmXinu / FanFicFare

FanFicFare is a tool for making eBooks from stories on fanfiction and other web sites.
Other
753 stars 161 forks source link

EPUB output is invalid: item IDs and IDREFs in OEBPS/content.opf contain invalid characters. #7

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Generate EPUB output, e.g., "python downaloder.py 
http://www.fanfiction.net/s/5782108/1/ epub".
2. Test it with epubcheck or http://threepress.org/document/epub-validate/. One 
of the errors will be: "ERROR: 
Harry_Potter_and_the_Methods_of_Rationality.epub/OEBPS/content.opf(17): bad 
value for attribute "id" ".

What is the expected output? What do you see instead?
The EPUB output is invalid. While it may work on some devices, it may fail on 
others.

What version of the product are you using? On what operating system?
I'm using current tip (26:54fc9b30ced5) on Python 2.6.4 (Ubuntu 9.10), plus the 
patch from issue 6 (which doesn't affect content.opf generation).

Please provide any additional information below.
There are several different issues causing validation to fail. This is one of 
them. See the appendix of the current OPF draft:

http://www.idpf.org/doc_library/epub/OPF_2.0.1_draft.htm#AppendixA

The "item" element inside the "manifest" element has an "id" attribute, which 
is of XML attribute type "ID". Additionally, the "itemref" element (inside the 
"spine" element) has an "idref" attribute, of XML attribute type "IDREF". See 
the XML 1.0 spec, section 3.3.1, and the description of the 'Name' production:

http://www.w3.org/TR/REC-xml/#sec-attribute-types
http://www.w3.org/TR/REC-xml/#NT-Name

In short, IDs and IDREFs cannot contain the '=' character, and cannot start 
with a number. The attached patch ensures that all IDs and IDREFs begin with 
'_', and ensures that only valid characters are used for the chapterId values. 
See also Wikipedia's note on encoding Base64 for use in XML. (The restriction 
on "Name" tokens described there is unnecessary here, as the first character of 
each token is always '_', which is always valid.)

http://en.wikipedia.org/wiki/Base64#XML

Additionally, I've changed constants.py to make it easier to read while 
changing things here.

Original issue reported on code.google.com by adam.buc...@gmail.com on 16 Sep 2010 at 3:52

Attachments:

GoogleCodeExporter commented 9 years ago
Adam, I apologize that I didn't realize you'd put patches here for a lot of 
issues until after I'd already coded my own fixes.

I went a different way with this, and just stopped using the problematic base64 
identifiers.

Original comment by retiefj...@gmail.com on 16 Oct 2010 at 1:59