bcgov / inclusive-names-service

Repository of code and other information useful to software developers and system managers wishing to make systems capable of storing Unicode names of people, places, and businesses
Apache License 2.0
5 stars 2 forks source link
citz indigenous-languages unicode utf-8

Lifecycle:Experimental

Techniques for Supporting Indigenous Language Text in Computer Systems

This site includes code and tips that will be useful to systems developers and maintainers who need to ensure that their computer systems can properly input, store, process, and display/export Unicode characters and graphemes (used in Indigenous language text). It also includes tips on supporting Indigenous language text when using Commercial-off-the-shelf (COTS) products.

Learn more about Including Indigenous languages in government records, systems and services - Province of British Columbia.

Programming Languages

Some older programming languages assume an equivalence between characters and bytes (i.e., one character requires exactly one byte of storage). With these languages, handling multi-byte or variable length encodings such as UTF-8 requires special libraries or techniques. The following link provides the details.

Programming Languages

Database Management Systems

Systems that process Unicode data and use database management systems (DBMS) need to have those DBMS's configured to store data using a Unicode encoding. The following link provides guidance for configuring a DBMS to use the UTF-8 encoding.

Databases

Commercial Off the Shelf (COTS) Products

COTS products in use in the BC Government vary in their support for Unicode, and in particular Indigenous language text. The following link provides guidance in using these products.

Using Commercial-off-the-shelf Products

Mainframe Systems

Depending on how the elements are configured, IBM mainframe systems may or may not be able to support Indigenous language characters.

Configuring mainframe systems to support Indigenous language text

Some Test Data

The following link points to a directory containing data files that have Unicode data.

Test Data

File Formats

This section provides guidance on handling Unicode data using various file formats (e.g., CSV, Excel)

File Formats

Data Transfer Protocols

This section provides guidance on handling Unicode data when using various data transfer protocols (e.g., ftp) Data Transfer Protocols

Data Flow Analysis Primer and Example

This section introduces the subject of data flow analysis, which can be used to evaluate whether a particular system might encounter issues when working with Unicode data.

Data Flow Analysis

How to Learn More

Some excellent articles:

W3C Internationalization Working Group Blog

"The W3C Internationalization (I18n) Activity works with W3C working groups and liaises with other organizations to make it possible to use Web technologies with different languages, scripts, and cultures. From this page you can find articles and other resources about Web internationalization, and information about the groups that make up the Activity. "

Migrating to Unicode

This is a comprehensive article from W3C, covering many of the issues that one will find when adapting an existing system to work with Unicode data.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

The author says: "In this article I'll fill you in on exactly what every working programmer should know. All that stuff about "plain text = ascii = characters are 8 bits" is not only wrong, it's hopelessly wrong, and if you're still programming that way, you're not much better than a medical doctor who doesn't believe in germs. Please do not write another line of code until you finish reading this article."

What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

"This article is about encodings and character sets. An article by Joel Spolsky entitled The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is a nice introduction to the topic and I greatly enjoy reading it every once in a while. I hesitate to refer people to it who have trouble understanding encoding problems though since, while entertaining, it is pretty light on actual technical details. I hope this article can shed some more light on what exactly an encoding is and just why all your text screws up when you least need it. This article is aimed at developers (with a focus on PHP), but any computer user should be able to benefit from it."

More articles