flycutter-zfz / rfc2kindle

Automatically exported from code.google.com/p/rfc2kindle
0 stars 0 forks source link

(Patch/fix included) Special characters in RFC text not converted to HTML, causing kindlegen to fail #1

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Invoke rfc2kindle/rfc2mobi to get RFC 4462 or any other RFC containing "<", 
or ">" (e.g. "./rfc2mobi rfc4462")
2. rfc2mobi successfully pulls rfc4462.txt and generates ./rfc4622/auths.png 
and ./rfc4622/rfc4622.html
3. rfc2mobi calls kindlegen. kindlegen reports a parse warning and a parse 
error (detailed below).
4. No .mobi created due to parse errors.

What is the expected output? What do you see instead?
Expected output: a .mobi file.
:Begin actual output:
[bar@testvm1 foo]$ ./rfc2mobi rfc4462
link = http://www.ietf.org/rfc/rfc4462.txt, doc = rfc4462

**************************************************
* Amazon.com kindlegen(Linux) V2.3 build 36043   *
* A command line e-book compiler                 *
* Copyright Amazon.com 2011                      *
**************************************************

Info(prcgen):I1047: Added metadata dc:Title        "RFC4462 - Generic Security 
Service Application Program Interface (GSS-API) Authentication and Key Exchange 
for the Secure Shell (SSH) Protocol"
Info(prcgen):I1002: Parsing files  0000002
Error(parsing):E3001: Requested XML node does not exist in memory.
Warning(inputpreprocessor):W29001: unescaped & which should be written as &amp;
      in file: /tmp/foo/rfc4462/rfc4462.html     line: 0001644
Info(cssparser):I10005: CSS file not found "/tmp/foo/css/rfc.css"
Error(prcgen):E21018: Enhanced Mobi building failure, while parsing content in 
the file. Content: <2. C calls GSS_Init_> in file: 
/tmp/foo/rfc4462/rfc4462.html line: 175
Successfully converted rfc4462 into rfc4462 directory.
:End actual output:

What version of the product are you using? On what operating system?
OS: Fedora Linux 16, x86_64
rfc2kindle: latest available via svn as of 12 Feb. 2012
kindlegen: 2.3 build 36043
python: 2.7.2

Please provide any additional information below.
Inserting the following code in html.py above all the regex matching logic 
inside def writeContent(self, line) resolves the issue:

# Sanitize &, <, and >
line=re.sub('&','&amp;',line)
line=re.sub('<','&lt;',line)
line=re.sub('>','&gt;',line)

Original issue reported on code.google.com by proverbs...@gmail.com on 13 Feb 2012 at 5:48

GoogleCodeExporter commented 8 years ago
The code I provided does not account for text inside <blockquote> blocks, as in 
the case of RFC 4463. Here's a cleaner fix (worked for me):

Create a sanitizing function containing the code I suggested earlier, and call 
it when needed. E.g., append this to html.py:

def sanitizeSpecChars(line):
    # Sanitize &, <, and >
    line=re.sub('&','&',line)
    line=re.sub('<','<',line)
    line=re.sub('>','>',line)
    return line

Then call it inside the "for i in outputlines" loop in outputTextBlock(self), 
right before the call to self.output.write(), e.g.:
    for i in outputlines:
        i=sanitizeSpecChars(i) # here
        self.output.write(...

And call it right before the regex matching logic in writeContent(self, line), 
e.g.:
        if isRFCPageBreaker(line):
            return getattr(self, "writeContent")
        line=sanitizeSpecChars(line) # here
        re.match(r'^d+\.?\s.*',...

It would probably make sense to sanitize the ToC and Abstract too, but I'm too 
tired to do that right now.

Original comment by proverbs...@gmail.com on 14 Feb 2012 at 7:26