anthill / open-moulinette

Scripts to clean Open-Data.
MIT License
40 stars 20 forks source link

check before merging data #34

Closed AlexisEidelman closed 8 years ago

AlexisEidelman commented 8 years ago

There's some modifications in output. There's also things to do for 2012 table (update 2011 values or merge on CODGEO only I presume)

armgilles commented 8 years ago

Have you advanced ?

If I can help you do not hesitate !

AlexisEidelman commented 8 years ago

yes, I think you can merge that PR which is inoffensive.

My other PR on branch dev is also finished, I think. It may be harder to merge. I suggest you try to merge (after that one) it and we speak together to see what we should improve. What do you think ?

armgilles commented 8 years ago

i make your PR. it seems there are some difference in the output file :

Edit :

If i look data with CODGEO == 781230106 :

2012 Census's data are OK btw :

Maybe a bad merge somewhere. If i look 'ARR' and 'CV' features of this PR (for CODGEO == 781230106), there are NaN (should be 783 7816).

AlexisEidelman commented 8 years ago

I think I find out. I forgot some correction on LIBGEO in 2012 data. I push something ASAP.

Note that anyway, according to me, it should be a little bit less rows in the output due to minor correction on LIBGEO.

PR checks CODGEO are unique in output

AlexisEidelman commented 8 years ago

I'm correcting a minor bug and it's ready

AlexisEidelman commented 8 years ago

Done.

armgilles commented 8 years ago

Good job !

I check the difference by curiosity. We lost 15 lines (15 iris) :

image

In the actual output we have 137 Iris doublon (122 in your PR) so this 15 lost iris make sense.

But if I look closely in our 15 lost iris, we lost information due to bad referential in Iris data... For exemple i take iris 831260503 (one of our 15 lost iris) :

So we lost census 2012 data here cause by the merging key :

key = ['CODGEO', 'LIBGEO', 'COM', 'LIBCOM', 'REG', 'REG2016', 'LAB_IRIS',
       'DEP', 'UU2010', 'TRIRIS', 'GRD_QUART', 'TYP_IRIS', 'MODIF_IRIS']

image

I'm not a big fan to loose data. So there are 2 possibilities :

What do you thinks it is easiest for you to do ?