clld / glottolog3

glottolog2 re-implemented as CLLD app
MIT License
110 stars 27 forks source link

Some gbid outliers and full urls instead of ids #77

Closed xflr6 closed 6 years ago

xflr6 commented 6 years ago
import pandas as pd
query = '''SELECT length(coalesce(gbid, '')), count(*),
  min(gbid), 'http://glottolog.org/resource/reference/id/' || min(id) AS url
FROM source JOIN ref USING (pk)
GROUP BY length(coalesce(gbid, '')) ORDER BY count DESC'''
pd.read_sql_query(query, 'postgresql://postgres@/glottolog3', index_col='length')
Out[1]: 
         count                                                min  \
length                                                              
0       297525                                               None   
45        8944      http://books.google.com/books?id=--eOmgEACAAJ   
12        8601                                       -_lCPgAACAAJ   
54           5  http://www.google.com/books/feeds/volumes/3RIM...   

                                                      url  
length                                                     
0           http://glottolog.org/resource/reference/id/10  
45      http://glottolog.org/resource/reference/id/100004  
12         http://glottolog.org/resource/reference/id/100  
54      http://glottolog.org/resource/reference/id/104284  

The full urls of length 45 are accounted for, but cases like 104284 yield broken links.

Clean this on import, so we always save the 12-character id?