autotyp / autotyp-data

AUTOTYP data export
Creative Commons Attribution 4.0 International
38 stars 20 forks source link

languages with more than one area #49

Open HedvigS opened 1 year ago

HedvigS commented 1 year ago

In this table register.csv there are languages with the same glottocodes which are associated with more than one area.

oira1263 for example is associated with both Inner Asia and Oceania. This seems to be because one of them should have the glottocode kalm1243, not oira1263 (LID = 1343).

There are 12 cases like this. I think each should be gone through and probably the glottocode & ISO 639-3 changed.

 1 oira1263   
 2 toho1245   
 3 tibe1272   
 4 indo1316   
 5 kyer1238   
 6 balk1252   
 7 east2295   
 8 kati1270   
 9 mart1256   
10 noga1249   
11 taha1241   
12 peri1253

Here's a way of finding them using R-code.

library(tidyverse)
AUTOTYP <- read_csv("data/csv/Register.csv"  ,col_types = cols()) %>% 
  distinct(Glottocode, Area, .keep_all = T) %>% 
  mutate(dup = duplicated(Glottocode) + duplicated(Glottocode, fromLast = T)) %>% 
  filter(dup > 0) 

Some of them make sense, like Tuareg (Air) (LID = 1420) and Tuareg (Ghat) (LID = 1421). The long lat of the varieties probably merits the different areas.

HedvigS commented 1 year ago

(There are also 132 entries of duplicates but with the same AUTOTYP-area. They seem to represent different data-collection events. Is that right? For example LID 148 & 579.)