digitallinguistics / transliterate

A small JavaScript library for transliterating strings between different orthographies
MIT License
9 stars 0 forks source link
digital-humanities digital-linguistics dlx linguistics transliteration

Transliterate

A small JavaScript library for transliterating and/or sanitizing strings. Tested against a variety of edge cases and unusual inputs.

GitHub Release GitHub issues DOI GitHub license GitHub stars

Overview

This library is useful for linguists and data analysts working with language data. It can be used to convert a string from one writing system to another (a process known as transliteration), or to remove unwanted characters or sequences of characters from a string (a process known as sanitization). This library handles common problems that arise during transliteration and sanitization, including bleeding and feeding issues.

Citation & Attribution

This library is maintained by Daniel W. Hieber. You can cite this library with its DOI using the following model:

Hieber, Daniel W. 2019. digitallinguistics/transliterate. DOI: 10.5281/zenodo.2550468.

Each version of this library is archived on this project's Zenodo page.

Installation

Install with npm or yarn:

npm install @digitallinguistics/transliterate # npm
yarn add @digitallinguistics/transliterate    # yarn

Importing the Library

In the browser, include the library in your HTML (adjust the src to point to the location of the transliterate.js file in your project):

<script src=transliterate.js type=module></script>

In Node, simply import the library:

import { transliterate } from '@digitallinguistics/transliterate';

Basic Usage

The transliterate library exports an object with four methods:

The sanitize and Sanitizer exports are essentially just aliases for transliterate and Transliterator respectively.

To transliterate a string, use the transliterate method:

// Import the "transliterate" method from the library
import { transliterate } from '@digitallinguistics/transliterate';

// The list of substitutions to make
const substitutions = {
  p: `b`,
  t: `d`,
  k: `g`,
};

// The string to transliterate
const input = `patak`;

// Transliterate the string
const output = transliterate(input, substitutions);

console.log(output); // --> "badag"

To save a set of transliteration rules for reuse on more than one string, use the Transliterator class:

// Import the Transliterator class
import { Transliterator } from '@digitallinguistics/transliterate';

// The list of substitutions to use for transliteration
const substitutions = {
  p: `b`,
  t: `d`,
  k: `g`,
};

// Create a transliterate function that always
// applies the same substitutions
const transliterate = new Transliterator(substitutions);

// The string to transliterate
const input = `patak`;

// Transliterate the string
const output = transliterate(input);

console.log(output); // --> "badag"

View the entire API for this library here.

Working with Substitution Rules

The transliterate library already handles several tricky cases on your behalf. For example, say you have the following substitution rules, and want to use them on the string abc:

Input Output
a b
b c

In this case, you probably intend the output to be bcc. But if you apply the a → b rule before the b → c rule, you get the output ccc. This is called a feeding problem. The transliterate library automatically avoids feeding problems, so that you get the expected result bcc rather than ccc.

Now say that you want to apply the following rules to the string abacad.

Input Output
a b
ac d

You probably intend the output to be abdbd. But if you apply the a → b rule before the ac → d rule, you get the output bbbcbd. This is called a bleeding problem. The transliterate library automatically avoids bleeding problems as well, so that you get the expected result abdbd rather than bbbcbd.

Here are some things to remember about how the transliterate library applies substitutions:

Sometimes the way you want to transliterate a character or sequence of characters will depend on context. For example, you might want a to sometimes become b, and other times become c. In this case you have several options: