Closed AlexisEidelman closed 8 years ago
Have you advanced ?
If I can help you do not hesitate !
yes, I think you can merge that PR which is inoffensive.
My other PR on branch dev is also finished, I think. It may be harder to merge. I suggest you try to merge (after that one) it and we speak together to see what we should improve. What do you think ?
i make
your PR. it seems there are some difference in the output file :
Edit :
If i look data with CODGEO == 781230106 :
2012 Census's data are OK btw :
Maybe a bad merge somewhere. If i look 'ARR' and 'CV' features of this PR (for CODGEO == 781230106), there are NaN (should be 783 7816).
I think I find out. I forgot some correction on LIBGEO in 2012 data. I push something ASAP.
Note that anyway, according to me, it should be a little bit less rows in the output due to minor correction on LIBGEO.
PR checks CODGEO are unique in output
I'm correcting a minor bug and it's ready
Done.
Good job !
I check the difference by curiosity. We lost 15 lines (15 iris) :
In the actual output we have 137 Iris doublon (122 in your PR) so this 15 lost iris make sense.
But if I look closely in our 15 lost iris, we lost information due to bad referential in Iris data... For exemple i take iris 831260503
(one of our 15 lost iris) :
So we lost census 2012 data here cause by the merging key :
key = ['CODGEO', 'LIBGEO', 'COM', 'LIBCOM', 'REG', 'REG2016', 'LAB_IRIS',
'DEP', 'UU2010', 'TRIRIS', 'GRD_QUART', 'TYP_IRIS', 'MODIF_IRIS']
I'm not a big fan to loose data. So there are 2 possibilities :
compare_geo
which try to improve merging process. If we look closer UU2010
, GRD_QUART
and LAB_IRIS
are the problem. I think we could fix this by casting UU2010
and GRD_QUART
to int. For LAB_IRIS
(Iris quality) maybe we can keep the 2012 valueWhat do you thinks it is easiest for you to do ?
There's some modifications in output. There's also things to do for 2012 table (update 2011 values or merge on CODGEO only I presume)