why CL-UNICODE::*CODE-POINTS-TO-UNICODE1-NAMES* does not contain entries for ALL code points?

JVDptt commented 2 years ago

This code should print the entire (code point -> canonical name)
mapping, right ? :

 (with-hash-table-iterator ( it CL-UNICODE::*CODE-POINTS-TO-UNICODE1-NAMES* )
  (loop
   (multiple-value-bind (more i) (it)
    (progn
     (when (not more) (return))
      (format t "~A ~A~%" i (gethash i CL-UNICODE::*CODE-POINTS-TO-UNICODE1-NAMES*))
     )
    )
   )
  )

But that print-out misses some characters, such as \u2248 and \u2249 - where are they ? : an excerpt from the print-out obtained by running the above code: ... 8682 WHITE UP ARROW FROM BAR 8788 COLON EQUAL 8789 EQUAL COLON 8804 LESS THAN OR EQUAL TO ... Why isn't it printing \u2248 (8776) : '≈' or \u2449 (8777) : ' ≉' ? Yet unicode-name resolves them OK :
CL-USER> (CL-UNICODE:unicode-name #\u2248) -> "ALMOST EQUAL TO" CL-USER> (CL-UNICODE:unicode-name #\u2249) -> "NOT ALMOST EQUAL TO"

And they are in Unicode v1 : CL-USER> (CL-UNICODE:age #\u2248) -> (1 1)

So that symbol is in unicode v1, so it should be a unicode1 name, and hence in the hash table ? What am I missing ? Why doesn't the print-out produced by above code include #\ALMOST_EQUAL_TO ?

Just wondering what the rules for inclusion in that table were, and if there is a more complete way of printing ALL recognized code points and names ?

Is cl-unicode somehow checking my locale and deciding which version of unicode names to include in the table, and omitting some because of version issues ?

It is very easy to print out a unicode table with eg. bash, not so easy to browse it by symbol name / meaning :-)

Thanks for cl-unicode! Best Regards, Jason

gefjon commented 2 years ago

*code-points-to-unicode1-names* is an internal variable, and shouldn't be treated as part of CL-UNICODE's interface.

That map contains only Unicode v1.0 code points, and as age is telling you, the characters you're asking about were introduced in Unicode v1.1.

If you want to print all the Unicode characters known to CL-UNICODE, you can do:

(defun print-all-unicode-chars (&optional (stream *standard-output*))
  (loop :for i :below cl-unicode:+code-point-limit+
        :for name := (cl-unicode:unicode-name i)
        :when name
          :do (format stream "~&~d ~a ~a~%" i (cl-unicode:age i) name)))

EDIT: markdown formatting

JVDptt commented 2 years ago

Many thanks, Phoebe - yes, that clarifies many things. All the best, Jason

On Fri, 17 Jun 2022 at 16:08, Phoebe Goldman @.***> wrote:

code-points-to-unicode1-names` is an internal variable, and shouldn't be treated as part of CL-UNICODE's interface.

That map contains only Unicode v1.0 code points, and as age is telling you, the characters you're asking about were introduced in Unicode v1.1.

If you want to print all the Unicode characters known to CL-UNICODE, you can do:

(defun print-all-unicode-chars (&optional (stream standard-output)) (loop :for i :below cl-unicode:+code-point-limit+ :for name := (cl-unicode:unicode-name i) :when name :do (format stream "~&~d ~a ~a~%" i (cl-unicode:age i) name)))

— Reply to this email directly, view it on GitHub https://github.com/edicl/cl-unicode/issues/33#issuecomment-1158965753, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZTWV4C5PMNLOEB5OUVEOZLVPSIFXANCNFSM5ZCREUNQ . You are receiving this because you authored the thread.Message ID: @.***>

edicl / cl-unicode

why CL-UNICODE::CODE-POINTS-TO-UNICODE1-NAMES does not contain entries for ALL code points? #33

edicl / cl-unicode

why CL-UNICODE::*CODE-POINTS-TO-UNICODE1-NAMES* does not contain entries for ALL code points? #33

why CL-UNICODE::CODE-POINTS-TO-UNICODE1-NAMES does not contain entries for ALL code points? #33