Closed noirbizarre closed 7 years ago
Sorry for the added size to the repository but I can't see any other option thant providing the source and export files
Maybe at least split the PR with a first commit which only adds sources?
How can I review the EPCI.ods
file? I don't get the necessity for it?
Can you try to add a lru_cache
on valid_at
function to see if it reduces time? (let's say with a max_size=None
at first 😉 )
Collections
and Items
but methods are left unchangedfiles loading does not takes much time, this is the Town.valid_at()
querying which takes a lot of time (written in the PR). Splitting this in 2 steps means one of the following:
Right now, the process avoid that by using a generator
Maybe at least split the PR with a first commit which only adds sources?
I don't see the point but I will :+1:
How can I review the EPCI.ods file? I don't get the necessity for it?
Fetch the branch and open it in LibreOffice or OpenOffice. Just try to reformat and serialize 19 CSV files using a common format (multiple times), also perform some search in the files and you will see the necessity ;) To me it was a huge gains of time and comfort but I can discard it from the PR
Can you try to add a lru_cache on valid_at
I don't think this will do anything more than load the memory and introduce overcoast: each town is looked once only for a given year because each town can only be in a single intercommunality at once. :cry:
each town is looked once only for a given year because each town can only be in a single intercommunality at once.
Crap :(
What about iterating on towns an injecting those into given EPCIs? If there are fewer ones that might be faster.
This is even worse: there is less lines in each intercommunality year than towns but iterating towns means iterating each files (except years after a given validity) the number of towns and this generate event more iteration (I've already examined this possibility) :(
$ time python -m geohisto --intercommunalities -v debug && wc -l exports/epci/epci.csv
Loading towns
Loading history
Loading populations
Loading counties
Computing history from actions
Applying special cases
Computing ancestors
Updating parents
Processing intercommunalities from sources/epci (1999-2017)
debug: Load intercommunalities from sources/epci/1999.csv
debug: Load intercommunalities from sources/epci/2000.csv
debug: Load intercommunalities from sources/epci/2001.csv
debug: Load intercommunalities from sources/epci/2002.csv
debug: Load intercommunalities from sources/epci/2003.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2003-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2004.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2004-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2005.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2005-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2006.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2006-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2007.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2007-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2008.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2008-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2009.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2009-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2010.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2010-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2011.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2011-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2012.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2012-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2013.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2013-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2014.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2014-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2015.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2015-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2016.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2016-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2017.csv
Writing towns file to exports/communes/communes.csv
Writing exports/communes/communes.csv head file to exports/communes/communes_head.csv
Writing intercommunalities file to exports/epci/epci.csv
Writing exports/epci/epci.csv head file to exports/epci/epci_head.csv
real 258m12,440s
user 159m38,164s
sys 0m7,019s
43243 exports/epci/epci.csv
time python -m geohisto --intercommunalities -v debug && wc -l exports/epci/epci.csv
Loading towns
Loading history
Loading populations
Loading counties
Computing history from actions
Applying special cases
Computing ancestors
Updating parents
Processing intercommunalities from sources/epci (1999-2017)
debug: Load intercommunalities from sources/epci/1999.csv
debug: Load intercommunalities from sources/epci/2000.csv
debug: Load intercommunalities from sources/epci/2001.csv
debug: Load intercommunalities from sources/epci/2002.csv
debug: Load intercommunalities from sources/epci/2003.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2003-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2004.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2004-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2005.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2005-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2006.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2006-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2007.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2007-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2008.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2008-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2009.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2009-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2010.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2010-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2011.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2011-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2012.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2012-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2013.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2013-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2014.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2014-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2015.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2015-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2016.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2016-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2017.csv
Writing towns file to exports/communes/communes.csv
Writing exports/communes/communes.csv head file to exports/communes/communes_head.csv
Writing intercommunalities file to exports/epci/epci.csv
Writing exports/epci/epci.csv head file to exports/epci/epci_head.csv
real 154m38,616s
user 154m24,401s
sys 0m13,624s
40495 exports/epci/epci.csv
Resulting line number differs because of some addition in extract_names
(which means more convergence).
Removing the extra sorts and applying some tuning gives a gain of 1h !
These changes don't break the tests nor introduce any changes in exports/towns/towns.csv
Splitting on year introduce one missing town for INSEE code 14697 starting on year 2003. This town is:
I think this has something to do with https://github.com/etalab/geohisto/blob/master/geohisto/specials.py#L432-L459 which doesn't handle the 14697
INSEE code
I can remove the EPCI.ods from the PR and only submit it on Data.gouv.fr (just an idea)
Before I forget, it deserves a 10.0.0 within the CHANGELOG 🎉
I can remove the EPCI.ods from the PR and only submit it on Data.gouv.fr (just an idea)
The problem with that file is that it cannot be reviewed nor display diffs on update so I'm in favor of having it as an external resource but I'm still not sure of the usage. Is it useful for input data sources or to deal with the exported one? Can I have a 5 min demo tomorrow?
Removing the extra sorts and applying some tuning gives a gain of 1h !
Not bad 😅
Splitting on year introduce one missing town for INSEE code 14697 starting on year 2003.
Alright, this is probably a bug on towns’ side, cases handled as ”specials” are manual fixes. Can you fill an issue about that?
Splitting on year introduce one missing town for INSEE code 14697 starting on year 2003.
Alright, this is probably a bug on towns’ side, cases handled as ”specials” are manual fixes. Can you fill an issue about that?
Done: #54
Before I forget, it deserves a 10.0.0 within the CHANGELOG :tada:
Added in the last commit :+1:
I can remove the EPCI.ods from the PR and only submit it on Data.gouv.fr (just an idea)
The problem with that file is that it cannot be reviewed nor display diffs on update so I'm in favor of having it as an external resource but I'm still not sure of the usage. Is it useful for input data sources or to deal with the exported one? Can I have a 5 min demo tomorrow?
Will be published with the CSVs on Data.gouv.fr here (private right now) and removed from the PR. The macro has been splitted out and published as a gist
I can do a five minute demo but there is nothing much to demonstrate. This allows to:
My proposal for the sorting part:
diff --git a/geohisto/actions.py b/geohisto/actions.py
index 5ea7f4c..44cace2 100644
--- a/geohisto/actions.py
+++ b/geohisto/actions.py
@@ -4,6 +4,7 @@ Perform actions on towns given the modifications' types.
The modifications come from the history of changes.
"""
import logging
+from collections import OrderedDict
from .constants import (
CHANGE_COUNTY, CHANGE_COUNTY_CREATION, CHANGE_NAME, CHANGE_NAME_CREATION,
@@ -395,3 +396,9 @@ def compute(towns, history):
except Exception as e:
print(record)
raise e
+
+ # Sort towns by id at the end of all computations, useful for tests.
+ tmp = [(id_, towns[id_]) for id_ in towns]
+ tmp.sort()
+ towns.clear()
+ towns.update(OrderedDict(tmp))
diff --git a/geohisto/models.py b/geohisto/models.py
index b10fc31..a8987ac 100644
--- a/geohisto/models.py
+++ b/geohisto/models.py
@@ -52,9 +52,9 @@ class CollectionMixin:
new_successor.id)
self.upsert(_item)
- def filter(self, sort=True, **filters):
+ def filter(self, **filters):
"""
- Return a (sorted) list of items with the given filters applied.
+ Return a list of items with the given filters applied.
Useful to look up by `depcom`, `nccenr` and so on.
"""
@@ -62,8 +62,6 @@ class CollectionMixin:
for item in self.values()
for k, v in filters.items()
if getattr(item, k) == v]
- if sort:
- _items.sort() # Useful for tests. (but very bad for performances)
return _items
def with_successors(self):
@@ -74,7 +72,7 @@ class CollectionMixin:
class Towns(CollectionMixin, OrderedDict):
def latest(self, depcom):
"""Get the most recent town for a given `depcom`."""
- _towns = self.filter(depcom=depcom, sort=False) # Sort once
+ _towns = self.filter(depcom=depcom)
_towns.sort(key=lambda town: town.end_datetime, reverse=True)
return _towns[0]
@@ -82,7 +80,7 @@ class Towns(CollectionMixin, OrderedDict):
"""Return a list of Towns existing at the given `valid_datetime`."""
# Beware, ternary operator is tricky here, keep it explicit.
if depcom:
- _towns = self.filter(depcom=depcom, sort=False)
+ _towns = self.filter(depcom=depcom)
else:
_towns = self.values()
return [town for town in _towns if town.valid_at(valid_datetime)]
@@ -249,7 +247,7 @@ class Intercommunalities(CollectionMixin, defaultdict):
"""Get the latest valid intercommunality for a given `siren`."""
# No need to sort, there is theoricaly only one town
# with a given siren at a time
- _items = self.filter(siren=siren, end_date=END_DATE, sort=False)
+ _items = self.filter(siren=siren, end_date=END_DATE)
return _items[0]
def valid_at(self, valid_date, siren=None):
@@ -265,8 +263,7 @@ class Intercommunalities(CollectionMixin, defaultdict):
@property
def open_sirens(self):
- return set(item.siren for item in self.filter(end_date=END_DATE,
- sort=False))
+ return set(item.siren for item in self.filter(end_date=END_DATE))
def ends(self, siren, year, reason):
intercommunality = self.latest(siren)
Diff implemented and improved (see last commits)
Initialy: 304.52 seconds Diff applyied (Sort once at actions.compute): 319.55 seconds Use generators: 224.33 seconds Tune actions.compute sort: 210.43 seconds
time python -m geohisto --intercommunalities -v debug && wc -l exports/epci/epci.csv
Loading towns
Loading history
Loading populations
Loading counties
Computing history from actions
Applying special cases
Computing ancestors
Updating parents
Processing intercommunalities from sources/epci (1999-2017)
debug: Load intercommunalities from sources/epci/1999.csv
debug: Load intercommunalities from sources/epci/2000.csv
debug: Load intercommunalities from sources/epci/2001.csv
debug: Load intercommunalities from sources/epci/2002.csv
debug: Load intercommunalities from sources/epci/2003.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2003-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2004.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2004-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2005.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2005-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2006.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2006-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2007.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2007-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2008.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2008-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2009.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2009-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2010.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2010-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2011.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2011-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2012.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2012-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2013.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2013-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2014.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2014-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2015.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2015-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2016.csv
error: Failed for Communauté de Communes des Trois Rivières on 14697@2016-01-01T00:00:00
debug: Load intercommunalities from sources/epci/2017.csv
Writing towns file to exports/communes/communes.csv
Writing exports/communes/communes.csv head file to exports/communes/communes_head.csv
Writing intercommunalities file to exports/epci/epci.csv
Writing exports/epci/epci.csv head file to exports/epci/epci_head.csv
real 84m21,209s
user 83m18,115s
sys 0m6,150s
40495 exports/epci/epci.csv
Obviously generators made the job !
There seams to be some more lines in exports, see last commit diff. I think this fixed some edge cases (the few cases I checked are not false positives)
PS: if you have some additions, please provide an easily reusable format (commit, PR to my PR...), not a copy-pasted diff
This PR adds initial intercommunalities support.
This is a first iteration lacking:
sources/epci/fusions.csv
completionsources/epci/fusions.csv
pmun
instead of the providedptot
Sources
Sources are extracted downloaded and reformatted from "Listes et compositions des EPCI à fiscalité propre publiées par la Direction Générale des collectivités locales".
There is both a french and english documentation for thoses but some sections still needs to be written (there is some TODOs)
There is an intermediate
EPCI.ods
with an embedded python macro to help the formatting and the generation of the CSVs in a proper format.Uniformized sources will be published back on data.gouv.fr
Note: Sources are sorted on the
siren
field. This is very important because the processing rely on it.Processing
Processing takes about 3 hours and a half and represents 604112 calls to
Towns.valid_at()
which explain the duration. In consequence, the intercommunalities processing has been made optionnal with the--intercommunalities
flag.Names format changes from one year to another and there is some typos, this is why we try to fix and normalize each name. The normalization process still needs some improvement on (between others):
Sometimes, an acronym is provided in the name, between parenthesis or brackets. We try to properly extract and keep it when possible. This seems to be a valuable information because some people only know the EPCI acronym instead of the full name.
Export
The export is fully documented into
exports/epci
keeping the format of others exports.Extra additions
This PR also adds some logging in order to see the progress of processing. The logging is handled by click-log which provides sane defaults for click integration. This adds an extra
-v/--verbosity LVL
option whereLVL
is one ofCRITICAL
,ERROR
,WARNING
,INFO
orDEBUG
(case does not matter) and defaults toINFO
.Result
The last execution produced the following on a core i7 with 16Gb RAM:
Possible improvements
siren
field before exportingNote
Sorry for the added size to the repository but I can't see any other option thant providing the source and export files