Languages extraction - Githubissues

Should find a way to standardize the (official) languages extraction from countries. The World factbook contains all possible information about the spoken languages, both official and not official. However, the structure of the data is not standardized, making it difficult to extract a list of languages.

For example, for most countries the structure is like this: Georgian (official) 87.6%, Azeri 6.2%, Armenian 3.9%, Russian 1.2%, other 1%; note - Abkhaz is the official language in Abkhazia (2014 est.)

therefore, there is a way to extract the official languages only, by extracting the part to the left of the string (official)

However, there are other cases like this:

Asante 16%, Ewe 14%, Fante 11.6%, Boron (Brong) 4.9%, Dagomba 4.4%, Dangme 4.2%, Dagarte (Dagaba) 3.9%, Kokomba 3.5%, Akyem 3.2%, Ga 3.1%, other 31.2% (2010 est.)

note: English is the official language

or like this: English (used in schools and for official purposes), Spanish, Italian, Portuguese

A deeper analysis is needed to find the best way to extract all languages.

aldotele / globon

Languages extraction #80