RichardLitt / low-resource-languages

Resources for conservation, development, and documentation of low resource (human) languages.
Creative Commons Attribution Share Alike 4.0 International
380 stars 56 forks source link
awesome awesome-list endangered-languages human-language language-documentation language-learning language-resources list low-resource-languages lrls minority-language natural-language natural-language-processing nlp resourced-languages

Low Resource Languages

Greenkeeper badge Build Status

Resources for conservation, development, and documentation of low resource (human) languages.

According to some estimates, half of the 7,000~ currently spoken languages are expected to become extinct this century. However, there is a lot of work by academics, independent scholars, organizations, communities, and individuals which goes towards stopping or slowing this trend. This list is intended to provide a list of open source code that would be useful for documenting, conserving, developing, preserving, or working with endangered languages.

Slack Group

We have a Slack group for live discussion. Join Us Here!

Publication

A white paper describing this repository was published at the LREC 2016 CCURL Workshop (Collaboration and Computing for Under-Resourced Languages). The paper is in this repository, in the papers folder. Download the raw paper here: Open Source Code Serving Endangered Languages.

Contribute

To edit this list on GitHub, simply click here. If you would like to discuss anything at all related to this, please open an issue. If you know of any resource available that is not on this list, please add it, either using the link above or by submitting pull requests.

There are more details on contributing in the CONTRIBUTING guide.

If you're interested in discussing the list in some offline capacity, get in touch with @RichardLitt. I'd be more than happy to have a phone call or email exchange.

Table of Contents

Table of Contents generated with DocToc

Definitions

Endangered languages are human languages that are in danger of extinction. This list also encompasses minority languages - languages which are spoken by a stable, but small, population (for example, Maltese or Hawai'ian); and low- or under-resourced languages, which may be spoken by a large population but are under-represented digitally (for instance, Quechua). These languages share certain characteristics in common; the most pertinent is sparse data and a lack of resources, ranging from spell-checkers to grammars to machine translation corpora. Other under-resourced languages that do not fall under this list include constructed languages (for instance, Klingon or Na'vi), computer languages (for instance, Javascript or Lua), and extinct languages that are so sparse as to be rendered computationally irrelevant for most purposes (for instance, Tocharian).

Open Source "promotes a universal access via a free license to a product's design or blueprint, and universal redistribution of that design or blueprint, including subsequent improvements to it by anyone." (Wiki). This is important because money and resources allocated towards a language or project that are not open source is spent at the expense of possible extensibility elsewhere.

This list used to be named endangered-languages. It was renamed to reflect that endangerment is a loaded term that both may not reflect the views of language communities speaking minority languages. low-resource-languages focuses this list on a lack of digital resources compared to other, high resourced languages.

Tools which are built for these languages are not included (unless relevant for dialects or variants): Arabic, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, Flemish, French, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Norwegian (Bokmål), Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Thai, Turkish, Ukrainian, Valencian, Vietnamese. This list comes from the list of most popular content languages for websites, on this Wikipedia page. Other metrics could be used - if you have another one, please suggest it!

This list is particularly good at one thing; showing the kinds of tools that exist in the field, generically. However, for in depth research into a specific language or tool suite, it does not perform exceptionally well. For instance, listing all of the Firefox language packs or Apertium language modules for each low resource language would be unhelpful, as would be including all of the tools available for Basque noted in the ACL Wiki, which would mainly mean cataloguing tools through the IXA group, some of which are open source, and some are not. Instead, view this list as a starting point for more research.

Looking for resources for code languages? Take a look at the awesome lists collection.

Generic Repositories

Single language lexicography projects and utilities

Utilities

Software

Keyboard Layout Configuration Helpers

Annotation

Format Specifications

i18n-related Repositories

Audio automation

Text-to-Speech (TTS)

Automatic Speech Recognition (ASR)

Text automation

Experimentation

Flashcards

Natural language generation

Computing systems

Android Applications

Chrome Extensions

FieldDB

FieldDB is actively worked on by the FieldDB (Formally known as OpenSourceFieldlinguistics) group. These repos explicitly work with it but could be repurposed for other projects.

FieldDB Webservices/Components/Plugins

Academic Research Paper-Specific Repositories

Example Repositories

These are repositories that are generally only interesting for training purposes or seeing how something is done.

Fonts

Corpora

These corpora are useful for working with tools on endangered languages. Monolingual corpora that are more for archival efforts should most likely not be included here.

Organizations

On GitHub

Other OSS Organizations

Tutorials

Language Specific Projects

For each language, we include the ISO 639-3 code, and the main autonym for that language.

Afrikaans

afr :: Afrikaans

Albanian

sqi :: shqip

Alutiiq

ems :: sugpiaq

Amharic

amh :: አማርኛ

Basque

eus :: euskara

Bengali

ben :: বাংলা

Chichewa

nya :: chicheŵa

Galician

glg :: galego

Apertium

Georgian

kat :: ქართული

Fonts

Internationalization and Localization (i18n/l10n)

Guarani

grn :: Guarani

Hausa

hau :: هَرْشَن هَوْسَ

Hindi

hin :: हिन्दी

Høgnorsk

nno :: Høgnorsk

Icelandic

isl :: íslenska

Inuktitut

iku :: Inuktitut

Irish

gle :: Gaeilge

Kinyarwanda

kin :: Ikinyarwanda

Kurdish

kur :: Kurdî

Lingala

lin :: Lingála

Lushootseed

lut :: Lushootseed

Malay

msa :: Bahasa Melayu

Malagasy

mlg :: Malagasy

Manx

glv :: Gaelg

Migmaq

mic :: Mi'kmaq

Minderico

drc :: Piação do Ninhou

Nishnaabe

oji :: Ojibwe, Oddawa, Chippewa, Anishinaabemowin, ᐊᓂᔑᓈᐯᒧᐎᓐ

Oromo

orm :: Oromo

Quechua

que :: Runa Simi

Sami

sma :: Sámi/Saami

Scottish Gaelic

gla :: Gàidhlig

Secwepemctsín

shs :: Secwepemctsín

Somali

som :: Soomaaliga

Tigrinya

tir :: ትግርኛ

Uralic

urj :: Uralic languages

Zulu

zul :: zulu

License

License: CC BY-SA 4.0 © Richard Littauer 2014-2017