Open GoogleCodeExporter opened 9 years ago
Thanks for the patch. I appreciate all of your contributions so far.
I feel as though I'd still like to create snippets as before (with a 50
character
cutoff followed by ellipses). So, I think I'm going to just do a conversion to
UTF-8
and then truncate the unicode string instead. It also appears as though the
original
code does not properly ignore HTML elements during the snippet creation.
I'll have a fix for this shortly.
Original comment by jlu...@gmail.com
on 15 Jan 2009 at 5:22
IMHO, splitting the word in the middle is ugly and it may change semantics (e.g.
"food" is truncated to "foo" and "Barack" is truncated to "Bar").
But easy-splitting is more generic as there are languages without whitespaces
at all.
Original comment by mathemonkey
on 15 Jan 2009 at 5:39
I've checked in r64 which does a conversion ot UTF-8 prior to creating the
snippet so
as not to truncate in the middle of a multi-byte character.
I also believe that r63 handles the problem with the subject being an int/long
instead of a string.
Please let me know if these new revisions fix your problems.
Original comment by jlu...@gmail.com
on 15 Jan 2009 at 7:57
> It also appears as though the original code does not properly ignore HTML
elements
during the snippet creation.
This bug is still there, <a href="foo/bar/baz"> does not match r'</?[^>/]+/?>'.
Regexp r'<[^>]+>' looks like better variant. Using r'</?[^>]+/?>' is
unnecessary as
same string set matches these two regexps and the first one is more simple.
Original comment by mathemonkey
on 15 Jan 2009 at 8:11
You're absolutely correct. I've fixed the regexp per your comment. It is r65.
Original comment by jlu...@gmail.com
on 15 Jan 2009 at 8:16
By the way, I've found one more argument against splitting the string at
character
border. Character border may be in the middle of html-entity (e.g. " -> &q...).
Yes, I've faced against this issue while exporting my small livejournal, it's
not
imaginary.
And there is minor LJ-specific issue with snippets: `<lj user="foobar">' should
be
translated to `foobar', right now it translates to `'.
Yes, I've seen this issue too.
Original comment by mathemonkey
on 15 Jan 2009 at 8:34
Original issue reported on code.google.com by
mathemonkey
on 14 Jan 2009 at 9:02Attachments: