Check out our interactive website: The XPF Corpus
The preliminary manual of the corpus can be found here.
./Code
contains the various scripts needed to obtain phoneme translation statistics. ./Data
contains language specific information in terms of their profiles and phonemic grammars. ./docs
contains the files strictly needed for the website. ./Guidelines
and ./Manual
contain relevant documentation pertaining to the corpus and the curation of it.Language Code | Language (click for info) | Reason (more thorough explanation in Rmd files) | Comments |
---|---|---|---|
acr | Rabinal Achi' | suspect marking of vowel length | lacks lenition |
ake | Akawaio | conflation between voiceless and voiced consonants | |
amp | Alamblak | conflation between /ɘ/ and /o/ | |
aoj | Mufian | conflation among vowels; ambiguity regarding vowel length and labialized consonant clusters | lacks lenition |
ar | Arabic | ambiguous transcription of alif; conflation between vowels and glides | |
arn | Mapudungun | ambiguous orthography; conflation between dental and alveolar consonants | |
awx | Awara | conflation between /nd/, /mb/, /nɡ/ and /d/, /b/, /ɡ/, respectively | |
bcl | Central Bikol | inconsistent marking of glottal stops | lacks lenition |
bmu | Somba Siawari | phonetic alphabet | |
btx | Batak Karo | conflation among /e/, /ɘ/, and /ɯ/ | |
bzd | Bribri | phonetic alphabet; contradicting documentation | |
bzh | Mapos Buang | conflation between /ɛ/ and other vowels | |
ca | Catalan | conflation among vowels and glides; ambiguous phonological interpretations | |
cav | Cavineña | ambiguity whether a digraph represents one phoneme or two, depending on syllable structure | lacks lenition |
chf | Tabasco Chontal | conflation between ejectives and stop-glottal stop sequences | |
chm | Mari | conflation with some palatalized and non-palatalized consonants; some vowels not always represented orthographically | lacks lenition |
cho | Choctaw | phonetic alphabet | |
cni | Asháninka | conflation among nasals | |
cof | Colorado | orthographic ambiguity with glottal stops | |
con | Cofan | conflation between consonants | |
crm | Moose Cree | /h/ represented only when contrast is required | lacks lenition |
dyo | Jola-Fogny | uncertainty around the marking of +ATR vowels | lacks lenition |
es | Spanish | non-transparent transcription of diphthongs | |
fuv | Nigerian Fulfulde | inconsistent marking of glottal stops; unclear transcription of palatalized glottal stop | |
hi | Hindi | conflation between /æ/ and /ɛ/; vowel nasalization ambiguity; unreliable marking of some consonants | |
id | Indonesian | conflation between /e/ and /ə/ | |
ixl | Ixil | word-initial glottal stop not always marked; somewhat ambiguous orthography | |
kea | Cape Verdean Creole | possible conflation between /a/ and /ɐ/, /e/ and /ɛ/, and /ɾ/ and /ʀ/ | lacks lenition |
kek | Qeqchi | ambiguity between ejective stops and stop-glottal stop sequences | |
kk | Kazakh | conflation between vowels and glides; widely contradicting phonological accounts of the language | |
kmo | Kwoma | non-transparent transcription of glottal stops | |
kyz | Kayabí | conflation between /i/ and /j/ | lacks lenition |
mcf | Matsés | conflation between alveolar and retroflex consonants; conflation between vowels | |
mek | Mekeo | non-transparent transcription of glottal stops | |
mfe | Morisyen | highly suspect orthography; conflation among consonants | |
ml | Malayalam | conflation between dental and alveolar /n/ | |
mlp | Bargam | conflation between /n/ and /ŋ/ | lacks lenition |
mnb | Muna | suspect orthography | |
mpx | Misima-Panaeati | conflation between /e/ and /ɛ/ and between /o/ and /ɔ/ | lacks lenition |
mt | Maltese | conflation between /ts/ and /dz/ and between /ʃ/ and /ʒ/ | |
myv | Erzya | conflation between /n/ and /ŋ/ | lacks lenition |
ne | Nepali | certain diacritics used interchangeably and inconsistently marked | |
not | Nomatsiguenga | conflation among nasals | |
or | Oriya | certain diacritics used interchangeably and inconsistently marked | |
os | Ossetic | conflation among /u/, /w/, and /ʷ/; inconsistent marking of consonant gemination | |
pag | Pangasinan | possible conflation between /ŋ/ and /nɡ/ | |
pib | Yine | conflation between /n/ and /h̃/ | lacks lenition |
plu | Palikúr | conflation between /ɡ/ and /ɣ/ | |
qub | Huallaga Huanuco Quechua | suspect orthography; conflation between vowels and glides | |
rwo | Rawa | conflation between /l/ and /r/ | |
sah | Yakut | conflation between /j/ and /j̃/ | |
sk | Slovak | non-transparent transcription of palatal consonants; ambiguity whether digraphs represent one phoneme or two | |
sm | Samoan | marking of long vowels and glottal stops is suspect | |
suz | Sunwar | conflation between /ɾ/, /ɭ/, and possibly /l̪/; inconsistent marking of glottal stops | |
sw | Swahili | conflation between syllabic nasals and non-syllabic counterparts | |
too | Xicotepec de Juárez Totonac | suspect transcription due to unclear documentation | |
tpp | Pisaflores Tepehua | suspect marking of vowel length | |
tzj | Tz'utujil | uncertainty around the marking of the glottal stop and the orthography | |
tzm | Central Atlas Tamazight | conflation between /l̪/ and /l̪ˤ/, and between /ʒ/ and /ʒˀ/ | |
wmw | Mwani | conflation between syllabic nasals and prenasalized stops | lacks lenition |
zsm | Standard Malay | conflation between /e/ and /ə/; conflicting orthographies | |
zza | Zaza | conflicting orthographies; conflation among vowels |
Language Code | Language | Reason |
---|---|---|
ace | Acehnese | non-transparent transcription of vowel nasalization |
ach | Acholi | non-transparent transcription of tones |
acu | Achuar-Shiwiar | non-transparent transcription of vowel nasalization |
adh | Adhola | non-transparent transcription of tones |
af | Afrikaans | non-transparent transcription of vowels, vowel length, and diphthongs |
agd | Agarabi | non-transparent transcription of tones |
agm | Angaataha | non-transparent transcription of tones |
agr | Aguaruna | non-transparent transcription of vowel nasalization |
ak | Akan | non-transparent transcription of tones |
alq | Algonquin | non-transparent transcription of vowel length |
am | Amharic | non-transparent transcription of consonant gemination |
anv | Denya | non-transparent transcription of tones |
as | Assamese | non-transparent transcription of vowels |
aso | Dano | non-transparent transcription of tones |
avt | Avar | non-transparent transcription of consonant gemination |
ban | Bali | non-standardized orthography |
bem | Bemba | non-transparent transcription of tones |
bba | Bariba | non-transparent transcription of tones |
bcw | Bana | non-transparent transcription of tones |
bhl | Bimin | non-transparent transcription of tones |
bm | Bambara | non-transparent transcription of tones |
bmr | Muinane | non-transparent transcription of tones |
bs | Bosnian | non-transparent transcription of vowel length and tones |
bsn | Barasana-Eduria | non-transparent transcription of tones |
bua | Buryat | non-transparent transcription of palatalization |
byr | Baruya | non-transparent transcription of tones |
cao | Chácobo | non-transparent transcription of tones |
cax | Chiquitano | non-transparent transcription of vowel nasalization |
cbc | Carapan | non-transparent transcription of tones |
ce | Chechen | non-transparent transcription of vowel length |
ceb | Cebuano | non-transparent transcription of vowel length |
chr | Cherokee | non-transparent transcription of vowel length |
cwk | Western Kaqchikel | non-transparent transcription of vowels |
cnh | Haka Chin | non-transparent transcription of tones |
coe | Koreguaja | non-transparent transcription of tones |
ctd | Tedim Chin | non-transparent transcription of tones |
cub | Cubeo | non-transparent transcription of tones |
cuk | San Blas Kuna | non-transparent transcription |
cy | Welsh | non-transparent transcription of vowel length |
da | Danish | non-transparent transcription of vowels |
daa | Dangaléat | non-transparent transcription of tones |
des | Desano | non-transparent transcription of tones |
dgo | Dogri | non-transparent transcription of tones |
din | Dinka | non-transparent transcription of tones |
dts | Toro So Dogon | non-transparent transcription of tones |
dz | Dzongkha | non-transparent transcription |
ee | Ewe | non-transparent transcription of tones |
efi | Efik | non-transparent transcription of tones |
emp | Northern Emberá | non-transparent transcription |
enb | Markweeta | non-transparent transcription of tones |
enq | Enga | non-transparent transcription of tones |
et | Estonian | non-transparent transcription of contrastive syllable length |
faa | Fasu | non-transparent transcription of tones |
fi | Finnish | non-transparent transcription |
fj | Fijian | non-transparent transcription of vowel length |
fo | Faroese | non-transparent transcription of vowels |
for | Fore | non-transparent transcription of tones |
fur | Friulian | non-transparent transcription of vowels |
fy | Frisian | non-transparent transcription of vowels |
ga | Irish | non-transparent transcription |
gah | Alekano | non-transparent transcription of tones |
gd | Scottish Gaelic | non-transparent transcription of consonants and vowels |
gl | Galician | non-transparent transcription |
gmo | Gamo-Gofa-Dawro | three languages understood to be linguistically separate |
grb | Grebo | non-transparent transcription of tones |
grt | Garo | non-transparent transcription of vowels |
gub | Guajajara | non-transparent transcription of vowel nasalization |
gum | Guambiano | non-standardized orthography |
gur | Farefare | non-transparent transcription of tones |
gv | Manx Gaelic | non-transparent transcription of consonants and vowels |
ha | Hausa | non-transparent transcription of vowel length |
hbs | Serbo-Croatian | non-transparent transcription of tones |
hch | Huichol | non-transparent transcription of tones |
heh | Hehe | non-transparent transcription of tones |
hr | Croatian | non-transparent transcription of vowel length |
hub | Huambisa | non-transparent transcription of vowel nasalization |
hui | Huli | non-transparent transcription of tones |
huv | Huave | inconsistent phonological documentation |
hz | Herero | non-transparent transcription of tones |
ig | Igbo | non-transparent transcription of tones |
ik | Inupiaq | insufficient tokens |
is | Icelandic | non-transparent transcription of vowel length |
jiv | Shuar | non-transparent transcription of vowel nasalization |
kab | Kabyle | non-transparent transcription of consonants |
kac | Jingpho | non-transparent transcription of tones |
kaq | Capanahua | non-transparent transcription of tones |
kbc | Kadiweu | non-transparent transcription of consonant gemination |
kbr | Kafa | non-transparent transcription of tones |
kha | Khasi | non-transparent transcription of vowel length |
khk | Khalkha Mongolian | non-transparent transcription of vowels |
ki | Gikuyu | non-transparent transcription of tones |
kj | Kwanyama | non-transparent transcription of tones |
kjs | East Kewa | non-transparent transcription of tones |
kew | West Kewa | non-transparent transcription of tones |
kmr | Northern Kurdish | non-transparent transcription of consonants |
kmu | Kanite | non-transparent transcription of tones |
ksd | Kuanua | non-transparent transcription of vowel length |
kus | Kusaal | non-transparent transcription of tones and vowel length |
kw | Cornish | non-transparent transcription of vowel length |
lac | Lacandon | non-transparent transcription of vowel length |
lb | Luxembourgish | non-transparent transcription of vowels |
lef | Lelemi | non-transparent transcription of tones |
lg | Luganda | non-transparent transcription of tones |
ln | Lingala | non-transparent transcription of tones |
loz | Lozi | non-transparent transcription of tones |
lt | Lithuanian | non-transparent transcription of tones |
luo | Dholuo | non-transparent transcription of tones |
lus | Mizo | non-transparent transcription of tones |
lv | Latvian | non-transparent transcription of tones |
lvs | Standard Latvian | non-transparent transcription of tones |
lwo | Luwo | non-transparent transcription of tones and breathy vowels |
man | Mandingo | non-transparent transcription of tones |
mas | Maasai | insufficient tokens |
mcb | Machiguenga | non-transparent transcription of tones |
mcd | Sharanahua | non-transparent transcription of tones |
meu | Motu | non-transparent transcription of vowel length |
mfi | Wandala | non-transparent transcription of tones |
mfz | Mabaan | non-transparent transcription of tones |
mhr | Eastern Mari | non-transparent transcription of palatalization |
mi | Maori | non-transparent transcription of vowel length |
miq | Miskito | non-transparent transcription of vowel nasalization and length |
mni | Meitei | non-transparent transcription of tones |
mos | Mossi | non-transparent transcription of tones |
mps | Dadibi | non-transparent transcription of tones and vowel nasalization |
mpt | Mian | non-transparent transcription of tones |
ms | Malay | non-transparent transcription of vowels |
my | Burmese | non-transparent transcription of tones |
myu | Mundurukú | non-transparent transcription of tones and creaky vowels |
myy | Macuna | non-transparent transcription of tones |
nd | Northern Ndebele | insufficient tokens |
nds | Low Saxon | non-transparent transcription |
nfr | Nafaanra | non-transparent transcription of tones |
nhg | Tetelcingo Nahuatl | non-transparent transcription of vowel length |
no | Norwegian | non-transparent transcription of tones and vowel length |
ntp | Northern Tepehuan | non-transparent transcription of tones |
nv | Navajo | non-transparent transcription of vowel nasalization |
ny | Chichewa | non-transparent transcription of tones |
nyn | Nyankore | non-transparent transcription of tones |
om | Oromo | non-transparent transcription of tones |
opm | Oksapmin | non-transparent transcription of vowels |
ood | Tohono O'odham | non-transparent transcription |
ots | Estado de México Otomi | non-transparent transcription of tones |
pab | Parecís | non-transparent transcription of vowel length and nasalization |
pao | Northern Paiute | non-transparent transcription of vowel length |
pap | Papiamentu | non-transparent transcription of vowels |
pir | Wanano | non-transparent transcription of tones |
pl | Polish | non-transparent transcription |
pms | Piedmontese | non-transparent transcription |
poh | Poqomchi' | insufficient documentation |
rw | Kinyarwanda | non-transparent transcription of tones and vowel length |
sd | Sindhi | non-transparent transcription of vowels |
se | Northern Sami | non-transparent transcription |
sg | Sango | non-transparent transcription of tones |
sim | Mende | non-transparent transcription of tones |
sll | Salt-Yui | non-transparent transcription of tones |
sn | Shona | non-transparent transcription of tones |
so | Somali | non-transparent transcription of tones |
soq | Kanasi | non-transparent transcription of glottal stops |
spp | Supyire Senoufo | non-transparent transcription of tones |
ss | Swati | non-transparent transcription of tones |
st | Sesotho | non-transparent transcription of tones |
sv | Swedish | non-transparent transcription |
swp | Suau | non-transparent transcription |
sxb | Suba | non-transparent transcription of tones |
tav | Tatuyo | non-transparent transcription of tones |
tcc | Datooga | non-transparent transcription of tones |
tcy | Tulu | non-transparent transcription of vowels |
tcz | Thadou Chin | non-transparent transcription of tones |
ti | Tigrinya | non-transparent transcription of gemination |
tk | Turkmen | non-transparent transcription of vowel length |
tl | Tagalog | non-transparent spalling of vowel length |
tn | Tswana | non-transparent transcription of tones |
toi | Tonga | non-transparent transcription of tones |
trp | Kok Borok | non-transparent transcription of tones |
ts | Tsonga | non-transparent transcription of tones |
ttc | Tekiteko | non-transparent transcription of vowel length |
tuf | Central Tunebo | non-transparent transcription of contrastive features (first syllable) |
tw | Twi | non-transparent transcription of tones |
ubu | Umbu-Ungu | non-transparent transcription of tones |
udu | Uduk | non-transparent transcription of tones |
ur | Urdu | non-transparent transcription of vowels |
ura | Urarina | non-transparent transcription of tones |
usp | Uspanteko | non-transparent transcription of tones |
ve | Venda | non-transparent transcription of tones |
vro | Võro | non-transparent transcription of vowels and palatalization |
wa | Walloon | non-transparent transcription |
wal | Wolaytta | non-transparent transcription of tones |
war | Waray-Waray | insufficient documentation |
wiu | Wiru | non-transparent transcription of tones |
xal | Kalmyk-Oirat | non-transparent transcription of vowels |
xav | Xavánte | non-transparent transcription of vowel length |
xbi | Kombio | non-transparent transcription of vowels |
xh | Xhosa | non-transparent transcription of tones |
xla | Kamula | non-transparent transcription of vowels and tones |
xsr | Sherpa | insufficient documentation |
yaa | Yaminahua | non-transparent transcription of tones |
yad | Yagua | non-transparent transcription of tones |
yby | Yaweyuha | non-transparent transcription of tones |
yo | Yoruba | non-transparent transcription of tones |
zai | Zapotec | non-transparent transcription of tones |
zca | Coatecas Altas Zapotec | non-transparent transcription of tones |
zpi | Santa María Quiegolani Zapotec | non-transparent transcription of tones |
zpq | Zoogocho Zapotec | non-transparent transcription of tones |
zu | Zulu | non-transparent transcription of tones |