First color in form #00ff00 removed after a link

GoogleCodeExporter commented 8 years ago

The first color information after a link is removed.

E.g this in orig:
<table>
<tr>
<td bgcolor="#00ff00">row 1, col1</td>
<td bgcolor="#00ff00">row 1, col2 <a href="P1.htm"> here a link</a></td>
<td bgcolor="#00ff00">row 1, col3</td>
</tr>
</table> 

Becomes this in work:
<table>
<tr>
<td bgcolor="#00ff00">row 1, col1</td>
<td bgcolor="#00ff00">row 1, col2 <a href="temp0001.html"> here a link</a></td>
<td bgcolor="">row 1, col3</td>
</tr>
</table>

Original issue reported on code.google.com by reto.kn...@gmail.com on 13 Nov 2011 at 4:40

GoogleCodeExporter commented 8 years ago

I think I found the problem: this one is too greedy!
# Replace links of the form "somefile.html#894" with "somefile0206.html"
# The following will match anchors like '<a href="temp0206.html#894"' and will 
store the 'temp0206.html' in backreference 1.
# The replace string will then replace it with '<a href="temp0206.html"', i.e. 
it will take away the '#894' part.
# This is because the numbers after the '#' are often wrong or non-existent. It 
is better to link to an existing
# chapter than to a non-existent part of an existing chapter.
page = re.sub('(?i)<a href="([^#]*)#[^"]*"', '<a href="\\1"', page)

because it matches everything until the next #, even if it is outside the link!
This seems to work better!

page = re.sub('(?i)<a href="([^(#|")]*)#[^"]*"', '<a href="\\1"', page)

Original comment by reto.kn...@gmail.com on 13 Nov 2011 at 5:39

GoogleCodeExporter commented 8 years ago

This is the first time I try to generate a patch... hope this is correct!

Changed one regular expression:
- added (?i) to make regex case insensitive
- search for #links stops at # and "
- * changed to + to ignore internal links "#..."

Original comment by reto.kn...@gmail.com on 20 Nov 2011 at 6:44

Attachments:

chm2pdf_color_removed.diff

RaptDept / chm2pdf

First color in form #00ff00 removed after a link #37