MTG / freesound

The Freesound website
https://freesound.org
GNU Affero General Public License v3.0
311 stars 87 forks source link

replace utf-8 control characters #202

Closed ghost closed 11 years ago

ghost commented 11 years ago

Several text fields contain control characters, which can't be displayed in utf-8. They correspond to the decimal range 128-160 (e.g. http://stuffofinterest.com/misc/utf8.php?s=128), using html entities is not recommended for these. They can be replaced directly on the database. The code below scans for forum posts, thread titles, sound descriptions, filenames, comments, pack names and pack descriptions.

qs = [Q(body__contains=unichr(x)) for x in range(128,160)]
f = reduce(lambda x,y: x|y, qs)
Post.objects.filter(f).count()

qst = [Q(title__contains=unichr(x)) for x in range(128,160)]
ft = reduce(lambda x,y: x|y, qst)
Thread.objects.filter(ft).count()

qss = [Q(description__contains=unichr(x)) for x in range(128,160)]
fs = reduce(lambda x,y: x|y, qss)
Sound.objects.filter(fs).count()

qss2 = [Q(original_filename__contains=unichr(x)) for x in range(128,160)]
fs2 = reduce(lambda x,y: x|y, qss2)
Sound.objects.filter(fs2).count()

qsc = [Q(comment__contains=unichr(x)) for x in range(128,160)]
fc = reduce(lambda x,y: x|y, qsc)
Comment.objects.filter(fc).count()

qsp = [Q(name__contains=unichr(x)) for x in range(128,160)]
fp = reduce(lambda x,y: x|y, qsp)
Pack.objects.filter(fp).count()

qsp2 = [Q(description__contains=unichr(x)) for x in range(128,160)]
fp2 = reduce(lambda x,y: x|y, qsp2)
Pack.objects.filter(fp2).count()
ghost commented 11 years ago

Python snippet showing how to remap utf characters. This has been tested against the FS1 post http://www.freesound.org/forum/viewtopic.php?t=6570


MAP_BAD_CHARS_TO_SIMILAR_ONES = {
    ord(u'\u0085'): u" ", # really should map to '...'
    ord(u'\u0092'): u"'",
    ord(u'\u0093'): u'"',
    ord(u'\u0094'): u'"',
}
silly_cp1252 = u'some \u0093silly\u0094 chars here'
print silly_cp1252
print silly_cp1252.translate(MAP_BAD_CHARS_TO_SIMILAR_ONES)
ghost commented 11 years ago

An explanation of how those chars where generated.

http://en.wikipedia.org/wiki/Windows-1252

ghost commented 11 years ago

From http://en.wikipedia.org/wiki/Character_encodings_in_HTML :

"Using numeric references that refer to permanently undefined characters and control characters is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference, so , for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding."