Open dpprdan opened 5 years ago
@KonradUdoHannes @AlexandraKapp Might be relevant for datenguidepy as well.
What is the source of regions.csv anyway? Is it possible to retrieve that information directly from the API as well?
What is the source of regions.csv anyway? Is it possible to retrieve that information directly from the API as well?
It's the same .csv file datenguidepy uses: https://github.com/CorrelAid/datenguide-python/tree/master/datenguidepy
regions.csv is a snapshot from the API that is simply stored for processing reasons. Essentially it can be obtained by querying all regions without any statistics and then analyzing the results with respect to the id codes. Most importantly it simplifies the parent behavior of the all regions query. The parent argument in the all regions query of the underlying API is a slight misnomer as it behaves not as parent but as ancestor in general yielding all lower hierarchy levels. The parent column in the regions.csv file only makes the connection to direct descendant, i.e. child regions, in order to enable a more granular selection of regions.
I'm also not sure the general "id" issues is something we want to address, but it might depend on whether I understand it correctly. In general the wrapper aims to supply full fidelity data from regionalstatistik.de, not doing any data cleaning on our side, leaving this to the user of the package. In combination with region definitions/ids not being constant throughout time, this leads to situations where one might find regions ids that have some historic data but no recent data anymore. When looking at regions in isolation without statistics and years one might therefore notice regions that appear several times, although I would expect that for a given statistic and year there is only data present for at most one of the occurrences.
with region definitions/ids not being constant throughout time, this leads to situations where one might find regions ids that have some historic data but no recent data anymore. When looking at regions in isolation without statistics and years one might therefore notice regions that appear several times, although I would expect that for a given statistic and year there is only data present for at most one of the occurrences.
Correct. E.g. you will find data for "03152" (Landkreis Göttingen) and "03156" (Landkreis Osterode am Harz) up until October 2016 and for "03159" (the new Landkreis Göttingen) as of November 2016.
[Der Landkreis Göttingen] entstand am 1. November 2016 durch die Fusion des bisherigen Landkreises Göttingen mit dem Landkreis Osterode am Harz. https://de.wikipedia.org/wiki/Landkreis_G%C3%B6ttingen
And I agree that it is essential that regions.csv contains all of these. Or more generally: The regions.csv/regions dataframe should contain all AGS/id
s which were in use since 1995 (?) until today and for which one can still retrieve data from regionalstatistik.de.
But a change in id
s due to a Gebietsänderung is not the issue here. The issue is that regionalstatistik.de dropped support for länderspezifische Gebietsschlüssel for Gemeinden in Gemeindeverbänden in Niedersachsen and Rheinland-Pfalz and that this change is not reflected in regions.csv. I.e. regions.csv contains outdated id
s that are not used by regionalstatistik.de anymore (i.e. you cannot retrieve any data with these anymore). In addition, regions.csv does not contain the correct AGS for these municipalities, which you now need to retrieve data. E.g. it does not contain "03353035" for Tostedt. That's why I think this needs to be addressed. (regionalstatistik.de dropping support for this is a good thing BTW, because the id
s they used were not used by anybody else AFAIK. Now they are using standard AGS). If I am not mistaken we are talking about >3900 incorrect id
s (all `id's with 7, 10 or 11 chars and those with 8 chars which have a parent with 11 chars).
regions.csv is a snapshot from the API that is simply stored for processing reasons. Essentially it can be obtained by querying all regions without any statistics and then analyzing the results with respect to the id codes.
I guess you are refering to https://github.com/CorrelAid/datenguide-python/blob/master/allRegionsAnalysis.ipynb, @KonradUdoHannes? Anyway, since the data stems from the datenguide API it has to be fixed there first. @crijke, did I assume correctly, that https://github.com/datenguide/datenguide-api/issues/11 is the corresponding issue? Does this also fix the issue for https://github.com/datenguide/genesapi-tabular?
Relatedly there is the issue of keeping the regions info up-to-date in the R and Python packages. Apart from more general changes like this one, there are also Gebietsänderungen happening two to three times a year on average, some of which also change the AGS for some municipalities, seldomly also counties (like LK Göttingen mentioned above). Those changes will come apparent in the data roughly half a year to one year after the official change, when Destatis publishes data for the first time with the new AGS/id
s.
Does anybody know a clever way to automatically keep the region info in the packages in sync, short of manually updating the packages? Could that info be made available on the API side more easily (i.e. faster) for example, so that the packages can retrieve a current copy at the start of the session, @crijke? Or am I overthinking this?
The python package has (so far undocumented) functionality to update the regions.csv file. The reason this does not happen automatically at the moment is that the run time is currently sub optimal and it takes about 2-3 minutes, which I don't want to bother the user with on every package load. I think there are straightforward ways to improve the run time and then maybe update the .csv on package load, but I just didn't have the chance to implement it yet. As far as I know the R package currently uses a regions.csv snapshot generated inside the python package. Improvements here would be a matter of re implementing the same functionality in R.
As for the datenguide/datenguide-api#11 issue I believe that is related buy might not be the only source of differences for regions. The other source will probably be synchronization between datenguide and regionalstatistik.de. The datenguide API provides only access to a snapshot of regionalstatistik.de. Consequently recent changes might not be reflected before the snapshot is updated again. Currently this happens regularly but not on a fixed schedule as far as I know.
I'm not sure how the relationship is to the gensisapi-tabular issue, but it is in part a reason we created the regions.csv that turned the API behavior, which interprets parent as ancestor into a 1-generation parent relationship. The hope is that this can be used to more robustly identify the regions one wants to query and then simply loop over their ids using the R or python package and simple single region requests. In particular one can worry about the intricacies of the regions before doing the actual data request, which might be beneficial for larger requests.
@dpprdan Thanks for all the feedback :+1:
The datenguide API provides only access to a snapshot of regionalstatistik.de.
The datenguide API should be updated nightly now: https://github.com/datenguide/datenguide-api/issues/107
it takes about 2-3 minutes
Is that mostly the response time from the API? That's what I was assuming when I wrote:
Could that info be made available on the API side more easily (i.e. faster)
Getting the direct parent should be pretty straightforward (now that regionalstatistik.de is not using these weird shortened Regionalschlüssel anymore), since it is encoded in the AGS. The only problem is the Statistische Regionen (AGS of length 3), which do not exist everywhere, so you'd have to use a conditional there.
There are errors in regions.csv/
dg_regions
with respect to the AGS/id
s of Gemeinden in Gemeindeverbänden in Lower Saxony and Rhineland-Palatinate if I'm not entirely mistaken.Regionaldatenbank Deutschland used to contain data for both Gemeindeverbände as well as Gemeinden in Lower Saxony and Rhineland-Palatinate.
dg_regions
contains the correspondingid
s:"03353406" is a (shortened) Regionalschlüssel of Samtgemeinde Tostedt, the Gemeindeverband, and "03353406035" is the (shortened) "Regionalschlüssel of Tostedt. ("shortened", because both are missing a "5" in sixth place, but this is how they were used in Regionaldatenbank Deutschland. I don't know/recall the reason for this).
However, these are not used by Regionaldatenbank Deutschland anymore, see the https://www.regionalstatistik.de/ homepage:
So unless these
id
s are somehow necessary for backward compatibility, I think they can be omitted fromdg_region
/regions.csv.More importantly, however, the correct AGS/
id
s are missing for Gemeinden in Gemeindeverbänden in the abovementioned Bundesländer. Tostedt's actual AGS is 03353035, Cochem's is 07135020, for example, see e.g. Destatis' Gemeindeverzeichnis. These are also the ones used by Regionalstatistik, see e.g. table 12411-01-01-5.I strongly suspect that this issue has to be fixed upstream as well. For example
returns
"region": null
on https://api-next.datengui.de/graphql, whereas the same query with id: "03353406" or "03353406035" returnsid
andname
(but noBEVSTD
data). The same is returned for Gemeinden in Gemeindeverbänden by the following query (the first six municipalities are not in Gemeindeverbänden, so their AGS is correct and hence data is returned).The corresponding query on tabular.genesapi.org returns the appropriate data, but shows the
id
s as labels for the Gemeinden in Gemeindeverbänden and not theirname
s.Querying for the region
id
directly returns data for 03353035 and not data for 03353406 or 03353406035.