etalab / geohisto

[UNMAINTAINED] Historic information for French regions, counties, overseas collectivities and towns based on INSEE and Wikipedia data, exported as (re)usable CSV files.
Other
28 stars 1 forks source link

Initial intercommunalities support #53

Closed noirbizarre closed 7 years ago

noirbizarre commented 7 years ago

This PR adds initial intercommunalities support.

This is a first iteration lacking:

Sources

Sources are extracted downloaded and reformatted from "Listes et compositions des EPCI à fiscalité propre publiées par la Direction Générale des collectivités locales".

There is both a french and english documentation for thoses but some sections still needs to be written (there is some TODOs)

There is an intermediate EPCI.ods with an embedded python macro to help the formatting and the generation of the CSVs in a proper format.

$ wc -l sources/epci/????.csv
   19129 sources/epci/1999.csv
   21348 sources/epci/2000.csv
   23498 sources/epci/2001.csv
   26871 sources/epci/2002.csv
   29755 sources/epci/2003.csv
   31429 sources/epci/2004.csv
   32309 sources/epci/2005.csv
   32924 sources/epci/2006.csv
   33414 sources/epci/2007.csv
   33639 sources/epci/2008.csv
   34167 sources/epci/2009.csv
   34775 sources/epci/2010.csv
   35042 sources/epci/2011.csv
   35306 sources/epci/2012.csv
   36050 sources/epci/2013.csv
   36615 sources/epci/2014.csv
   36589 sources/epci/2015.csv
   35859 sources/epci/2016.csv
   35412 sources/epci/2017.csv
  604131 total

Uniformized sources will be published back on data.gouv.fr

Note: Sources are sorted on the siren field. This is very important because the processing rely on it.

Processing

Processing takes about 3 hours and a half and represents 604112 calls to Towns.valid_at() which explain the duration. In consequence, the intercommunalities processing has been made optionnal with the --intercommunalities flag.

Names format changes from one year to another and there is some typos, this is why we try to fix and normalize each name. The normalization process still needs some improvement on (between others):

Sometimes, an acronym is provided in the name, between parenthesis or brackets. We try to properly extract and keep it when possible. This seems to be a valuable information because some people only know the EPCI acronym instead of the full name.

Export

The export is fully documented into exports/epci keeping the format of others exports.

Extra additions

This PR also adds some logging in order to see the progress of processing. The logging is handled by click-log which provides sane defaults for click integration. This adds an extra -v/--verbosity LVL option where LVL is one of CRITICAL, ERROR, WARNING, INFO or DEBUG (case does not matter) and defaults to INFO.

Result

The last execution produced the following on a core i7 with 16Gb RAM:

$ time python -m geohisto --intercommunalities -v debug && wc -l exports/epci/epci.csv
Loading towns
Loading history
Loading populations
Loading counties
Computing history from actions
Apply special cases
Compute ancestors
Update the parents for each town.
Processing intercommunalities from sources/epci (1999-2017)
debug: Load intercommunalities from sources/epci/1999.csv
debug: Load intercommunalities from sources/epci/2000.csv
debug: Load intercommunalities from sources/epci/2001.csv
debug: Load intercommunalities from sources/epci/2002.csv
debug: Load intercommunalities from sources/epci/2003.csv
debug: Load intercommunalities from sources/epci/2004.csv
debug: Load intercommunalities from sources/epci/2005.csv
debug: Load intercommunalities from sources/epci/2006.csv
debug: Load intercommunalities from sources/epci/2007.csv
debug: Load intercommunalities from sources/epci/2008.csv
debug: Load intercommunalities from sources/epci/2009.csv
debug: Load intercommunalities from sources/epci/2010.csv
debug: Load intercommunalities from sources/epci/2011.csv
debug: Load intercommunalities from sources/epci/2012.csv
debug: Load intercommunalities from sources/epci/2013.csv
debug: Load intercommunalities from sources/epci/2014.csv
debug: Load intercommunalities from sources/epci/2015.csv
debug: Load intercommunalities from sources/epci/2016.csv
debug: Load intercommunalities from sources/epci/2017.csv
Writing towns file to exports/communes/communes.csv
Writing towns head file to exports/communes/communes_head.csv
Writing intercommunalities file to exports/epci/epci.csv

real    215m2,377s
user    212m53,826s
sys 0m15,816s

43244 exports/epci/epci.csv

Possible improvements

Note

Sorry for the added size to the repository but I can't see any other option thant providing the source and export files

davidbgk commented 7 years ago

Sorry for the added size to the repository but I can't see any other option thant providing the source and export files

Maybe at least split the PR with a first commit which only adds sources?

davidbgk commented 7 years ago

How can I review the EPCI.ods file? I don't get the necessity for it?

davidbgk commented 7 years ago

Can you try to add a lru_cache on valid_at function to see if it reduces time? (let's say with a max_size=None at first 😉 )

noirbizarre commented 7 years ago
  1. I will :+1:
  2. there is no change at all on the town generation, the diff is only messed because of the factorising of Collections and Items but methods are left unchanged
  3. files loading does not takes much time, this is the Town.valid_at() querying which takes a lot of time (written in the PR). Splitting this in 2 steps means one of the following:

    • storing 604112 objects in memory plus the towns
    • writing and reloading the intermediate step (which means taking the same amount of time)

    Right now, the process avoid that by using a generator

Maybe at least split the PR with a first commit which only adds sources?

I don't see the point but I will :+1:

How can I review the EPCI.ods file? I don't get the necessity for it?

Fetch the branch and open it in LibreOffice or OpenOffice. Just try to reformat and serialize 19 CSV files using a common format (multiple times), also perform some search in the files and you will see the necessity ;) To me it was a huge gains of time and comfort but I can discard it from the PR

Can you try to add a lru_cache on valid_at

I don't think this will do anything more than load the memory and introduce overcoast: each town is looked once only for a given year because each town can only be in a single intercommunality at once. :cry:

davidbgk commented 7 years ago

each town is looked once only for a given year because each town can only be in a single intercommunality at once.

Crap :(

What about iterating on towns an injecting those into given EPCIs? If there are fewer ones that might be faster.

noirbizarre commented 7 years ago

This is even worse: there is less lines in each intercommunality year than towns but iterating towns means iterating each files (except years after a given validity) the number of towns and this generate event more iteration (I've already examined this possibility) :(

noirbizarre commented 7 years ago

Last executions

Filter on year

$ time python -m geohisto --intercommunalities -v debug && wc -l exports/epci/epci.csv
Loading towns
Loading history
Loading populations
Loading counties
Computing history from actions
Applying special cases
Computing ancestors
Updating parents
Processing intercommunalities from sources/epci (1999-2017)
debug: Load intercommunalities from sources/epci/1999.csv
debug: Load intercommunalities from sources/epci/2000.csv
debug: Load intercommunalities from sources/epci/2001.csv
debug: Load intercommunalities from sources/epci/2002.csv
debug: Load intercommunalities from sources/epci/2003.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2003-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2004.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2004-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2005.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2005-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2006.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2006-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2007.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2007-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2008.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2008-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2009.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2009-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2010.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2010-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2011.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2011-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2012.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2012-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2013.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2013-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2014.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2014-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2015.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2015-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2016.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2016-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2017.csv
Writing towns file to exports/communes/communes.csv
Writing exports/communes/communes.csv head file to exports/communes/communes_head.csv
Writing intercommunalities file to exports/epci/epci.csv
Writing exports/epci/epci.csv head file to exports/epci/epci_head.csv

real    258m12,440s
user    159m38,164s
sys 0m7,019s
43243 exports/epci/epci.csv

Sorts removed

time python -m geohisto --intercommunalities -v debug && wc -l exports/epci/epci.csv
Loading towns
Loading history
Loading populations
Loading counties
Computing history from actions
Applying special cases
Computing ancestors
Updating parents
Processing intercommunalities from sources/epci (1999-2017)
debug: Load intercommunalities from sources/epci/1999.csv
debug: Load intercommunalities from sources/epci/2000.csv
debug: Load intercommunalities from sources/epci/2001.csv
debug: Load intercommunalities from sources/epci/2002.csv
debug: Load intercommunalities from sources/epci/2003.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2003-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2004.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2004-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2005.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2005-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2006.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2006-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2007.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2007-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2008.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2008-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2009.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2009-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2010.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2010-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2011.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2011-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2012.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2012-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2013.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2013-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2014.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2014-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2015.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2015-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2016.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2016-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2017.csv
Writing towns file to exports/communes/communes.csv
Writing exports/communes/communes.csv head file to exports/communes/communes_head.csv
Writing intercommunalities file to exports/epci/epci.csv
Writing exports/epci/epci.csv head file to exports/epci/epci_head.csv

real    154m38,616s
user    154m24,401s
sys 0m13,624s
40495 exports/epci/epci.csv

Notes

Resulting line number differs because of some addition in extract_names (which means more convergence).

Removing the extra sorts and applying some tuning gives a gain of 1h !

These changes don't break the tests nor introduce any changes in exports/towns/towns.csv

Splitting on year introduce one missing town for INSEE code 14697 starting on year 2003. This town is:

I think this has something to do with https://github.com/etalab/geohisto/blob/master/geohisto/specials.py#L432-L459 which doesn't handle the 14697 INSEE code

noirbizarre commented 7 years ago

I can remove the EPCI.ods from the PR and only submit it on Data.gouv.fr (just an idea)

davidbgk commented 7 years ago

Before I forget, it deserves a 10.0.0 within the CHANGELOG 🎉

davidbgk commented 7 years ago

I can remove the EPCI.ods from the PR and only submit it on Data.gouv.fr (just an idea)

The problem with that file is that it cannot be reviewed nor display diffs on update so I'm in favor of having it as an external resource but I'm still not sure of the usage. Is it useful for input data sources or to deal with the exported one? Can I have a 5 min demo tomorrow?

davidbgk commented 7 years ago

Removing the extra sorts and applying some tuning gives a gain of 1h !

Not bad 😅

Splitting on year introduce one missing town for INSEE code 14697 starting on year 2003.

Alright, this is probably a bug on towns’ side, cases handled as ”specials” are manual fixes. Can you fill an issue about that?

noirbizarre commented 7 years ago

Splitting on year introduce one missing town for INSEE code 14697 starting on year 2003.

Alright, this is probably a bug on towns’ side, cases handled as ”specials” are manual fixes. Can you fill an issue about that?

Done: #54

Before I forget, it deserves a 10.0.0 within the CHANGELOG :tada:

Added in the last commit :+1:

I can remove the EPCI.ods from the PR and only submit it on Data.gouv.fr (just an idea)

The problem with that file is that it cannot be reviewed nor display diffs on update so I'm in favor of having it as an external resource but I'm still not sure of the usage. Is it useful for input data sources or to deal with the exported one? Can I have a 5 min demo tomorrow?

Will be published with the CSVs on Data.gouv.fr here (private right now) and removed from the PR. The macro has been splitted out and published as a gist

I can do a five minute demo but there is nothing much to demonstrate. This allows to:

davidbgk commented 7 years ago

My proposal for the sorting part:

diff --git a/geohisto/actions.py b/geohisto/actions.py
index 5ea7f4c..44cace2 100644
--- a/geohisto/actions.py
+++ b/geohisto/actions.py
@@ -4,6 +4,7 @@ Perform actions on towns given the modifications' types.
 The modifications come from the history of changes.
 """
 import logging
+from collections import OrderedDict

 from .constants import (
     CHANGE_COUNTY, CHANGE_COUNTY_CREATION, CHANGE_NAME, CHANGE_NAME_CREATION,
@@ -395,3 +396,9 @@ def compute(towns, history):
         except Exception as e:
             print(record)
             raise e
+
+    # Sort towns by id at the end of all computations, useful for tests.
+    tmp = [(id_, towns[id_]) for id_ in towns]
+    tmp.sort()
+    towns.clear()
+    towns.update(OrderedDict(tmp))
diff --git a/geohisto/models.py b/geohisto/models.py
index b10fc31..a8987ac 100644
--- a/geohisto/models.py
+++ b/geohisto/models.py
@@ -52,9 +52,9 @@ class CollectionMixin:
                                                 new_successor.id)
                 self.upsert(_item)

-    def filter(self, sort=True, **filters):
+    def filter(self, **filters):
         """
-        Return a (sorted) list of items with the given filters applied.
+        Return a list of items with the given filters applied.

         Useful to look up by `depcom`, `nccenr` and so on.
         """
@@ -62,8 +62,6 @@ class CollectionMixin:
                   for item in self.values()
                   for k, v in filters.items()
                   if getattr(item, k) == v]
-        if sort:
-            _items.sort()  # Useful for tests. (but very bad for performances)
         return _items

     def with_successors(self):
@@ -74,7 +72,7 @@ class CollectionMixin:
 class Towns(CollectionMixin, OrderedDict):
     def latest(self, depcom):
         """Get the most recent town for a given `depcom`."""
-        _towns = self.filter(depcom=depcom, sort=False)  # Sort once
+        _towns = self.filter(depcom=depcom)
         _towns.sort(key=lambda town: town.end_datetime, reverse=True)
         return _towns[0]

@@ -82,7 +80,7 @@ class Towns(CollectionMixin, OrderedDict):
         """Return a list of Towns existing at the given `valid_datetime`."""
         # Beware, ternary operator is tricky here, keep it explicit.
         if depcom:
-            _towns = self.filter(depcom=depcom, sort=False)
+            _towns = self.filter(depcom=depcom)
         else:
             _towns = self.values()
         return [town for town in _towns if town.valid_at(valid_datetime)]
@@ -249,7 +247,7 @@ class Intercommunalities(CollectionMixin, defaultdict):
         """Get the latest valid intercommunality for a given `siren`."""
         # No need to sort, there is theoricaly only one town
         # with a given siren at a time
-        _items = self.filter(siren=siren, end_date=END_DATE, sort=False)
+        _items = self.filter(siren=siren, end_date=END_DATE)
         return _items[0]

     def valid_at(self, valid_date, siren=None):
@@ -265,8 +263,7 @@ class Intercommunalities(CollectionMixin, defaultdict):

     @property
     def open_sirens(self):
-        return set(item.siren for item in self.filter(end_date=END_DATE,
-                                                      sort=False))
+        return set(item.siren for item in self.filter(end_date=END_DATE))

     def ends(self, siren, year, reason):
         intercommunality = self.latest(siren)
noirbizarre commented 7 years ago

Diff implemented and improved (see last commits)

Tests duration

Initialy: 304.52 seconds Diff applyied (Sort once at actions.compute): 319.55 seconds Use generators: 224.33 seconds Tune actions.compute sort: 210.43 seconds

Full execution with changes

time python -m geohisto --intercommunalities -v debug && wc -l exports/epci/epci.csv
Loading towns
Loading history
Loading populations
Loading counties
Computing history from actions
Applying special cases
Computing ancestors
Updating parents
Processing intercommunalities from sources/epci (1999-2017)
debug: Load intercommunalities from sources/epci/1999.csv
debug: Load intercommunalities from sources/epci/2000.csv
debug: Load intercommunalities from sources/epci/2001.csv
debug: Load intercommunalities from sources/epci/2002.csv
debug: Load intercommunalities from sources/epci/2003.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2003-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2004.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2004-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2005.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2005-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2006.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2006-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2007.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2007-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2008.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2008-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2009.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2009-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2010.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2010-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2011.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2011-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2012.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2012-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2013.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2013-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2014.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2014-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2015.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2015-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2016.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2016-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2017.csv
Writing towns file to exports/communes/communes.csv
Writing exports/communes/communes.csv head file to exports/communes/communes_head.csv
Writing intercommunalities file to exports/epci/epci.csv
Writing exports/epci/epci.csv head file to exports/epci/epci_head.csv

real    84m21,209s
user    83m18,115s
sys 0m6,150s
40495 exports/epci/epci.csv

Obviously generators made the job !

There seams to be some more lines in exports, see last commit diff. I think this fixed some edge cases (the few cases I checked are not false positives)

PS: if you have some additions, please provide an easily reusable format (commit, PR to my PR...), not a copy-pasted diff