PCGen / pcgen

Main code and data development for pcgen program release
http://pcgen.org
GNU Lesser General Public License v2.1
431 stars 341 forks source link

Square box characters in UI #4257

Closed mjmeans closed 5 years ago

mjmeans commented 6 years ago

@grimreaper I see there is a lot of code PRs lately. There is a nagging UI problem that can be easily fixed in code. Particularly in Pathfinder there is a lot of descriptions that show square boxes because the character code doesn't exist in the UI font. Sometimes this also exists in PDF export as well. An example of this is the ’ versus ' characters. The former (angled typographical apostrophe) shows a square box in the UI and the latter (ASCII apostrophe) does not. One specific example of this is in Pathfinder / Ultimate Magic / um_spells.lst in the description for the spell "Ki Leach". I have checked several sources and there are hundreds of these character issues. Some of the apostrophe issues are in an ability KEY so to fix those it would require a massive number of migration entries. Besides apostrophe, I see that the EMDASH Unicode character also has a problem in the UI. Some characters also have a problem in PDF export. I believe there could be an easy CODE solution. Are you able to do a quick survey of Unicode support and see if there is an easy fix for this?

LegacyKing commented 5 years ago

This should be fixed with me fixing the bad characters.

LegacyKing commented 5 years ago

@Zaister you able to address this concern?

grimreaper commented 5 years ago

I think @LegacyKing solved this. If not, please re-open.

mjmeans commented 5 years ago

If the solution in this case was to fix the bad characters in DATA, then that is perhaps an immediate solution, but not, ultimately, a great solution because it keeps in place a DATA contract that is adverse to the common copy-paste strategy of creating new DATA sources and the need for DATA to then examine each and every item shown in the UI and OS. It's not immediately obvious when copying/pasting a source that these Unicode characters exist, especially when some of them look nearly identical to the non-Unicode versions.

I think a better solution is to officially support UTF-16 Unicode characters which I think is necessary for support of localization of DATA to different written languages. I believe I saw someone post that localization support is on the horizon. If so I'm hopeful that we will eventually be able to use any UTF-16 characters in the various textual output tags (NAME, DESCRIPTION, ASPECT, etc.) intended to be visible in the UI or OS. Even the non-multilingual portions of the Unicode 'private use area' could be utilized if PCGen could be deployed with an operating system agnostic open source symbols font.

thpr commented 5 years ago

Mark,

We would love your support doing that code or paying some of us to work on the code. Until then we have a VERY limited code team, and thus we have what we have.

mjmeans commented 5 years ago

As I have said before, I don't know Java coding. About all I could contribute would be to investigate OpenType font sets to see which ones I think might be usable and organize them in some kind of list. However, I know nothing about any game system other than Pathfinder, so my analysis would be biased toward Pathfinder support and therefore not necessarily complete or authoritative. And the utility of any work in that area that I might do is therefore inconclusive. But it could be like a first pass examination of available Unicode font options that would, over time, become stale. If localization is actually planned for 7.0 I would be willing to spend some time investigating the open source fonts available and the general Unicode font substitution schemes used when developing cross-platform applications and report on that.

grimreaper commented 5 years ago

We will never support anything other than UTF-8. This is sufficient for localization to all languages. Nothing specific is planned w.r.t. localization, but we're happy to fix issues as we find them.

I'm not sure why you're talking about fonts or how that's relevant here.

mjmeans commented 5 years ago

re: fonts relevance.... because 1) not all operating systems will have fonts installed capable of printing all possible Unicode characters; and 2) symbol fonts that use the "private use areas" of the encoding are specific to each operating system.

Private use areas of Unicode range from U+E000 to U+F8FF. Each font typeface is permitted to use ANY glyphs for this range and they aren't expected to be the same characters between fonts. Many of the private use codes are used for less frequently of archaic multilingual characters not present in the standard encoding, particularly for Japanese and East Asian character sets. So this range may be necessary when supporting localization to other languages, particularly East Asian languages and the support will be specific to a specific font face and not otherwise portable between operating systems. Additionally, this range overlaps symbol characters in symbol fonts. So in a program that is expected to display these special extended localized character, but installed in an operating system that doesn't have a specific font installed, the operating system may substitute the wrong font face, or even a symbol font face to resolve this range.

The consequences of those are 1) if an operating system doesn't have a particular font for a particular language, then the operating system will use a substitution font that will produce unexpected results or more empty square boxes or strange symbols; and 2) when the intent was a specific symbol, the operating system may substitute a different symbol font producing a different symbol.

So, if PCGen is going to support any form of Unicode encoding, it has to either decide to not support the "private use area" of the Unicode encoding scheme (and therefore only partially support East Asian languages), or it has to provide it's own open source font(s) for resolution of the extended Asian glyphs and symbols expected to be used.

mjmeans commented 5 years ago

More information on this private use area of Unicode and it's problems is available at: http://www.unicode.org/faq/private_use.html

grimreaper commented 5 years ago

""" it has to either decide to not support the "private use area" of the Unicode encoding scheme (and therefore only partially support East Asian languages)"" We don't make use of the private use area symbols and these are entirely unrelated to the type of localization relevant to pcgen. We don't support and never have supported any encoding other than UTF-8 (and possibly ASCII).

Embedding fonts in pcgen to make up for OS deficiencies might be required. However, nowadays, I suspect the OS will keep up faster than we will.

mjmeans commented 5 years ago

As a coder, I'm assuming that you know all this already, but I'm adding it for completeness and for the benefit of any others that might want detail on the issue.

re: "We don't make use of the private use area symbols and these are entirely unrelated to the type of localization relevant to pcgen.".

That is true only if localization in PCGen will never need to use any of the infrequently used localized characters that only occur in the private use area. This includes the Basic Multilingual Plane portion of the private use area. https://en.wikipedia.org/wiki/Private_Use_Areas

Without using any of the Unicode private use area you will not get complete support for East Asian languages when using UTF-8. UTF-16/32 solves that, but at the expense of ASCII compatibility. A good compromise, in my opinion, would be to follow what W3C has done and use UTF-8 as the mandatory coding with the ability to use UTF-16 in rare cases. The support is application specific, however. It's not a function of the operating system beyond the simple ability to organize font faces and call up a font that mostly matches what is requested, and store and process the codes. The application (in this case web browser) decides which font to request from the operating system based on it's own default fonts and what is specified in the HTML. The operating system isn't the entire solution and can't be the entire solution for PCGen either because automatic font substitution doesn't always work. Of course support of the "private use area" can, and probably should, be deferred until East Asian languages are on the horizon for PCGen.

However, if Unicode eventually does become supported, it opens the can of worms that OS provided font substitution causes. For example, let's say that the data is expecting to use the Helvetica font and some of it's "private use area" glyphs to resolve some needed extended localized characters (I think there are even a few frequently less used Latin and Cyrillic characters in some fonts in that area too). Anyway, if the OS doesn't have Helvetica installed it would normally substitute any other installed font designated to be in the same serif-font family as Helvetica. When that happens the substituted font probably won't have the expected glyphs at the same codes in it's private use area. This can be mitigated by PCGen detecting any "private use area" codes in the displayed data and either telling the operating system to disable automatic font substation, or replacing the code points with the standard undefined glyph when presenting that text. This would at least display the small square boxes, instead of displaying a wrong glyph.

Lastly, concerning UTF-8 and UTF-16. Using UTF-16 has performance benefits in Windows, Java and JavaScript because they internally store text as UTF-16 and no conversion is necessary when loading data already in UTF-16 format. However, the wiki article https://en.wikipedia.org/wiki/UTF-16#U+10000_to_U+10FFFF states that Linux and MacOS, rarely use UTF-16. I did not investigate the performance implications of Java (which internally uses UTF-16) in Linux/MacOS (which apparently internally expects UTF-8).