edicl / cl-unicode

Portable Unicode library for Common Lisp
https://edicl.github.io/cl-unicode/
61 stars 24 forks source link

Update unicode data to unicode 10.0.0 #13

Closed neil-lindquist closed 6 years ago

neil-lindquist commented 7 years ago

The current unicode data is from 6.2.0, so this would add the new characters from the past 5 years, including over 350 new emoji.

neil-lindquist commented 6 years ago

This originally was updating the data to Unicode 9, but since Unicode 10 is available, I updated the pull request to use those files instead

stassats commented 6 years ago

A lot of tests are now failing.

neil-lindquist commented 6 years ago

I don't think updating the Unicode data is causing the failures. The failing tests also fail when I try to run the tests from the current master branch (commit 45d3ff1). I tried on sbcl 1.4.2 and clisp 2.48, both with Windows 10.

I ran the tests using sbcl --noprint --eval "(ql:quickload :cl-unicode/test)" --eval "(asdf:operate 'asdf:test-op :cl-unicode)" --eval "(quit)" and clisp -x "(ql:quickload :cl-unicode/test) (asdf:operate 'asdf:test-op :cl-unicode) (quit)" with my repository of cl-unicode in the local-projects directory of quicklisp. The sbcl outputs had no difference in outputs between the commits. The clisp outputs only differed in the memory address in the printing of the return value of asdf:operate (i.e. #<ASDF/PLAN:SEQUENTIAL-PLAN #x1C6B93B1>) sbcl_unicode5.txt sbcl_unicode10.txt clisp_unicode5.txt clisp_unicode10.txt

neil-lindquist commented 6 years ago

I realized that I didn't run clean.cmd between running the tests (and thus the derived properties tests weren't refreshed). After running tests again with running clean.cmd between each run, there where differences between the runs. However, for the most part, they are just changes in the numberings and the addition of more, passing, tests (which makes sense, given characters with derived properties were added). However, there is one new failure, (HAS-BINARY-PROPERTY (CHARACTER-NAMED "CHAM PUNCTUATION DOUBLE DANDA" :WANT-CODE-POINT-P T) "STerm") returned NIL I'll start looking into this failure.

neil-lindquist commented 6 years ago

In Unicode 10, the long name alias of STerm was renamed to Sentence_Terminal (see http://unicode.org/reports/tr44/ under PropertyAliases.txt). The short name remained STerm, so it was the same as adding a new alias for the property. However, cl-unicode doesn't load alias's from PropertyAliases.txt, so only the long name is used. I suspect adding support for PropertyAliases.txt would be preferred to breaking backwards compatability.

I think this will entail adding another lookup table build from PropertyAliases.txt and running property names through that before looking them up in the current tables. Should that be a part of this pull request or a separate one?

neil-lindquist commented 6 years ago

I've added property aliasing, which fixed the test failure caused by the renamed property. I ended up fixing a few more failing tests (like (STRING= "Basic Latin" (CODE-BLOCK 1)) returned NIL in the simple tests) because I starting thinking that they were also new. That was a simple regex tweak in the split lines when reading data.

stassats commented 6 years ago

I'm still getting
got an unexpected error: There is no property called "Changes_When_Casemapped".

neil-lindquist commented 6 years ago

I get that failure when running the current master branch. It's caused by the fields starting on line 5183 in DerivedCoreProperties.txt, but the derived property Changes_When_Casemapped isn't defined (and there are similar properties for the similar failures).

I've fixed the failures for Cased and Cased_Insensitive in a branch built off this one (https://github.com/neil-lindquist/cl-unicode/tree/fix-derived-tests), but the others require NFD normalization to be implemented, which CL-Unicode currently doesn't do (https://github.com/Ferada/cl-unicode/tree/decomposition-mapping does start implementing normalization).