Open GoogleCodeExporter opened 8 years ago
I think I found the problem: this one is too greedy!
# Replace links of the form "somefile.html#894" with "somefile0206.html"
# The following will match anchors like '<a href="temp0206.html#894"' and will
store the 'temp0206.html' in backreference 1.
# The replace string will then replace it with '<a href="temp0206.html"', i.e.
it will take away the '#894' part.
# This is because the numbers after the '#' are often wrong or non-existent. It
is better to link to an existing
# chapter than to a non-existent part of an existing chapter.
page = re.sub('(?i)<a href="([^#]*)#[^"]*"', '<a href="\\1"', page)
because it matches everything until the next #, even if it is outside the link!
This seems to work better!
page = re.sub('(?i)<a href="([^(#|")]*)#[^"]*"', '<a href="\\1"', page)
Original comment by reto.kn...@gmail.com
on 13 Nov 2011 at 5:39
This is the first time I try to generate a patch... hope this is correct!
Changed one regular expression:
- added (?i) to make regex case insensitive
- search for #links stops at # and "
- * changed to + to ignore internal links "#..."
Original comment by reto.kn...@gmail.com
on 20 Nov 2011 at 6:44
Attachments:
Original issue reported on code.google.com by
reto.kn...@gmail.com
on 13 Nov 2011 at 4:40