autotyp / autotyp-data

AUTOTYP data export
Creative Commons Attribution 4.0 International
38 stars 20 forks source link

Syntactically ill-formed iso-codes plus one glottocode in data/Register.csv #9

Closed xflr6 closed 2 years ago

xflr6 commented 6 years ago
In [1]: import pandas as pd
   ...: 
   ...: URL = 'https://github.com/autotyp/autotyp-data/raw/master/data/Register.csv'
   ...: 
   ...: ISO = r'[a-z]{3}$'
   ...: GCODE = r'[a-z]{4}[1-9][0-9]{3}$'
   ...: 
   ...: df = pd.read_csv(URL, encoding='utf-8', index_col='LID')
   ...: 
   ...: df.loc[~df['ISO639.3'].str.match(ISO).fillna(True), ['Language', 'Stock', 'ISO639.3']]
Out[1]: 
       Language          Stock ISO639.3
LID                                    
301   Tocharian  Indo-European     tokh
185        Mixe     Mixe-Zoque     mixe
431      Berber         Berber     berb
762       Cuica       Macro-Ge     cuic
764   Esmeralda      Esmeralda     esme
766   (Frisian)  Indo-European     fris
800     Sorbian  Indo-European     sorb
1696      Chaga    Benue-Congo     chag

In [2]: df.loc[~df['Glottocode'].str.match(GCODE).fillna(True), ['Language', 'Stock', 'Glottocode']]
Out[2]: 
    Language         Stock Glottocode
LID                                  
672  Jingpho  Sino-Tibetan    jin1260
tzakharko commented 2 years ago

This is fixed in 1.0.0