erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.32k stars 2.94k forks source link

Unicode: '鿛' is categorized as unassigned codepoint #8748

Open g-andrade opened 1 month ago

g-andrade commented 1 month ago

Describe the bug

The undocumented function unicode_util:lookup/1 - which I'm not supposed to use - categorizes as "Other / not assigned" (Cn) instead of the expected "Other letter" (Lo):

Being an undocumented function may justify closing this issue right away, but I thought I should report it as it may not be the intended internal behaviour.

To Reproduce

% unicode_util:lookup($鿛).
#{category => {other,not_assigned},
  canon => [],ccc => 0,compat => []}

Expected behavior

% unicode_util:lookup($鿛).
#{category => {letter,other},
  canon => [],ccc => 0,compat => []}

Affected versions

OTP 26.2.5.2

dgud commented 1 month ago

While looking for clues I stumbled upon that UnicodeData.txt could contain ranges, so it was an easy fix.

For backward compatibility, ranges in the file UnicodeData.txt are specified by entries for the start and end characters of the range, rather than by the form "X..Y". The start character is indicated by a range identifier, followed by a comma and the string "First", in angle brackets. This entry takes the place of a regular character name in field 1 for that line. The end character is indicated on the next line with the same range identifier, followed by a comma and the string "Last", in angle brackets: