lj2b breaks unicode string while generating snippet

browsejobs / google-blog-converters-appengine

Automatically exported from code.google.com/p/google-blog-converters-appengine

Apache License 2.0

0 stars 1 forks source link

lj2b breaks unicode string while generating snippet #14

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago

lj2b may break the string in the middle of UTF-8 character and it breaks
string->unicode conversion during stringifying the feed.

One of the possible solutions is to split string at words boundary and take
first N words, it's implemented in the patch.

Also, LJ returns following subject for http://darkk.livejournal.com/31547.html:
<member>
  <name>subject</name>
  <value>

<int>54308428790203478762340052723346983453487023489987231275412390872348475</in
t>
  </value>
</member>

so I had to add the second hunk (yes, it's ugly and totally hackish).

Original issue reported on code.google.com by mathemonkey on 14 Jan 2009 at 9:02

Attachments:

unicode.patch

GoogleCodeExporter commented 8 years ago

Thanks for the patch.  I appreciate all of your contributions so far.

I feel as though I'd still like to create snippets as before (with a 50 
character
cutoff followed by ellipses).  So, I think I'm going to just do a conversion to 
UTF-8
and then truncate the unicode string instead.  It also appears as though the 
original
code does not properly ignore HTML elements during the snippet creation.

I'll have a fix for this shortly.

Original comment by jlu...@gmail.com on 15 Jan 2009 at 5:22

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

IMHO, splitting the word in the middle is ugly and it may change semantics (e.g.
"food" is truncated to "foo" and "Barack" is truncated to "Bar").
But easy-splitting is more generic as there are languages without whitespaces 
at all.

Original comment by mathemonkey on 15 Jan 2009 at 5:39

GoogleCodeExporter commented 8 years ago

I've checked in r64 which does a conversion ot UTF-8 prior to creating the 
snippet so
as not to truncate in the middle of a multi-byte character.

I also believe that r63 handles the problem with the subject being an int/long
instead of a string.

Please let me know if these new revisions fix your problems.

Original comment by jlu...@gmail.com on 15 Jan 2009 at 7:57

Changed state: Started

GoogleCodeExporter commented 8 years ago

> It also appears as though the original code does not properly ignore HTML 
elements
during the snippet creation.

This bug is still there, <a href="foo/bar/baz"> does not match r'</?[^>/]+/?>'.
Regexp r'<[^>]+>' looks like better variant. Using r'</?[^>]+/?>' is 
unnecessary as
same string set matches these two regexps and the first one is more simple.

Original comment by mathemonkey on 15 Jan 2009 at 8:11

GoogleCodeExporter commented 8 years ago

You're absolutely correct.   I've fixed the regexp per your comment.  It is r65.

Original comment by jlu...@gmail.com on 15 Jan 2009 at 8:16

GoogleCodeExporter commented 8 years ago

By the way, I've found one more argument against splitting the string at 
character
border. Character border may be in the middle of html-entity (e.g. " -> &q...).
Yes, I've faced against this issue while exporting my small livejournal, it's 
not
imaginary.

And there is minor LJ-specific issue with snippets: `<lj user="foobar">' should 
be
translated to `foobar', right now it translates to `'.
Yes, I've seen this issue too.

Original comment by mathemonkey on 15 Jan 2009 at 8:34