anthill / open-moulinette

Scripts to clean Open-Data.
MIT License
40 stars 20 forks source link

Updated all links #44

Closed Asphahyre closed 7 years ago

Asphahyre commented 7 years ago

Corrected dead links Added datas from previous years when avalaible

Asphahyre commented 7 years ago

Need to update mk_data.py script. Work in progress. Please do not merge for now.

Asphahyre commented 7 years ago

I took the liberty of renaming the original CODGEO, LIBGEO and LIBCOM columns respectively by IRIS, LIB_IRIS and LIB_COM, since the data/equip-* files contains those columns as is, and it's more explicit.

Use of this line to remove duplicates iris entries (in case of change of iris, or renaming of iris) : data.drop_duplicates(subset='IRIS', keep='first', inplace=True) I didn't understand how that worked before without this line, since I needed it, but that may be done a way cleaner.

Thanks to @armgilles for his explanations on the mk_data.py.

armgilles commented 7 years ago

Using this PR to give some details.

Do not use data.drop_duplicates(subset='IRIS', keep='first', inplace=True), we lost data by this way.

The problem comes from Census files and the others files.

The key to merge files between them was (originally) :

key = ['IRIS', 'LIB_IRIS', 'COM', 'LIB_COM', 'REG', 'DEP']

But we have duplicate IRIS :

image

Using a new key for merging like ['IRIS', 'REG', 'DEP']

We merge this lines with no lost of information (right or left)

Doing automatic check for :