Open seisman opened 1 year ago
TODO list after PR #2584:
Was trying to get the character ā (Latin small letter a with macron) to plot using either ISO-8859-4/ISO-8859-10/ISO-8859-13 in https://github.com/GenericMappingTools/pygmt/pull/2641#discussion_r1305245499, but doesn't work when setting pygmt.config(PS_CHAR_ENCODING="ISO-8859-4")
, because we need to use --PS_CHAR_ENCODING
inline according to https://docs.generic-mapping-tools.org/6.4/gmt.conf.html#term-PS_CHAR_ENCODING:
Note: Normally the character set is written as part of the PostScript header. If you need to switch to another character set for a later overlay then you must use --PS_CHAR_ENCODING=encoding on the command line and not via gmt gmtset.
Workaround was to use the composite character @!a\225
following https://docs.generic-mapping-tools.org/6.4/tutorial/session-2.html#plotting-text-strings. I'm not sure if it's worth adding --PS_CHAR_ENCODING
as an option to plotting methods to make it easier. I think we discussed a while ago not to support double-dash --
inline options?
After PRs #2584, #2638, #3192, and #3199, PyGMT already provides basic support for non-ASCII characters.
In short, we're maintaining a big dictionary mapping non-ASCII characters to their octal codes. So users can pass a character like ɑ
(alpha) and PyGMT will map it to @~\\141@~
. There is no direct way to type ɑ
using a keyboard (maybe there are shortcuts, but who can remember them all?), so users usually need to copy and paste from another source. However, many characters look similar. For example:
In [19]: "Ω" == "Ω"
Out[19]: False
In [20]: "Δ" == "∆"
Out[20]: False
In [21]: import unicodedata
In [22]: unicodedata.name("Ω")
Out[22]: 'OHM SIGN'
In [23]: unicodedata.name("Ω")
Out[23]: 'GREEK CAPITAL LETTER OMEGA'
In [24]: unicodedata.name("Δ")
Out[24]: 'GREEK CAPITAL LETTER DELTA'
In [25]: unicodedata.name("∆")
Out[25]: 'INCREMENT'
Since these characters are so similar, users may use the "incorrect" one and then get surprising results. Actually, we're using some incorrect characters in our mapping dictionary.
To solve the problem, we need character tables that users can copy. The official GMT documentation provides the tables (https://docs.generic-mapping-tools.org/dev/reference/octal-codes.html) in PNG/PDF format but they're not easy to copy. Better tables are available at
However, these tables are "incomplete" (some characters are missing) compared to the GMT ones. For example, in the Symbol table, \322
is ©
, but in https://www.compart.com/en/unicode/charsets/Adobe-Symbol-Encoding, it's mapped to Unicode character U+F6DA, which belongs to the "Private Use Area" block (I guess there must be some historical reasons behind). So, instead of using the private, invisible U+F6DA, we should map ©
(U+00A9) to @~\\322@~
. It also means we need to maintain our character tables.
This repository https://github.com/seisman/GMT-octal-codes maintains the mapping files that can map Unicode characters to GMT octal codes. Check the README files in that repository for how the mapping files are created.
With the well-maintained mapping files, we can refactor the mapping dictionary in the PyGMT project and add character tables for the supported encodings, as done in #3206.
Problems
Due to the limitation of the PostScript language, GMT can only work with ASCII characters and a small set of non-ASCII characters. See https://docs.generic-mapping-tools.org/latest/cookbook/octal-codes.html for the full list of characters that PostScript/GMT/PyGMT can accept.
These non-ASCII characters must be specified using their octal codes or character escape sequence. A few non-ASCII characters (e.g., ü, Î) are allowed and GMT can substitute these non-ASCII characters with the correct PostScript octal codes.
Users who don't know the limitations may pass non-ASCII characters directly in the arguments. For example:
The above script produces this "surprising" figure:
So, if users want to add a non-ASCII character to a plot, they must know the limitations and have to go to this page https://docs.generic-mapping-tools.org/latest/cookbook/octal-codes.html, look for the character in the four tables, and figure out the corresponding octal code (
\260
for the symbol°
), which is tedious and not easy.After finding the octal code, users may think changing
°
to\260
should work:but it still produces the same "surprising" figure, because the Python interpreter recognizes
\260
first, and converts it to°
before passing it to the GMT API. So, users have to use double backslashes or raw strings:or
Solutions
Since Python works well with non-ASCII characters (acutally it works with any unicode characters), it's possible to pass
°
in Python, and PyGMT should substitute the non-ASCII characters with the corresponding octal codes.Here are some tests in Python:
So, if we can do the substitutions/conversions internally, we can support non-ASCII characters better. The simplest solution is to define a big dictionary that maps non-ASCII characters (e.g.,
°
) to octal codes (e.g.,\260
). Better and more clever solutions are also possible.Notes about the possible limitations of the solutions
Non-ASCII characters can be used in many cases:
frame="WSen+tTime (s) vs Distance (°)"
fig.text(x=0, y=0, text="Distance (°)")
0 0 Distance (°)
The above solution should work well for case 1, may work or not work (depending on the implentation) for case 2, and likely don't work for case 3.
Are you willing to help implement and maintain this feature?
Yes, but more discussions are needed.