CorrelAid / datenguideR

R wrapper for the datengui.de GraphQL API to easily access German regional statistics
GNU General Public License v3.0
26 stars 3 forks source link

Incorrect `id`s in `dg_regions` #18

Open dpprdan opened 5 years ago

dpprdan commented 5 years ago

There are errors in regions.csv/dg_regions with respect to the AGS/ids of Gemeinden in Gemeindeverbänden in Lower Saxony and Rhineland-Palatinate if I'm not entirely mistaken.

Regionaldatenbank Deutschland used to contain data for both Gemeindeverbände as well as Gemeinden in Lower Saxony and Rhineland-Palatinate. dg_regions contains the corresponding ids:

library(datenguideR)
get_region(name, "Tostedt")
#> # A tibble: 2 x 4
#>   id          name    level parent  
#>   <chr>       <chr>   <chr> <chr>   
#> 1 03353406    Tostedt lau   03353   
#> 2 03353406035 Tostedt lau   03353406
get_region(name, "Cochem")
#> # A tibble: 2 x 4
#>   id         name   level parent 
#>   <chr>      <chr>  <chr> <chr>  
#> 1 0713500020 Cochem lau   07135  
#> 2 0713501020 Cochem lau   0713501

"03353406" is a (shortened) Regionalschlüssel of Samtgemeinde Tostedt, the Gemeindeverband, and "03353406035" is the (shortened) "Regionalschlüssel of Tostedt. ("shortened", because both are missing a "5" in sixth place, but this is how they were used in Regionaldatenbank Deutschland. I don't know/recall the reason for this).

However, these are not used by Regionaldatenbank Deutschland anymore, see the https://www.regionalstatistik.de/ homepage:

Ab 01.02.2019 wird der gesamte Datenbestand der Regionaldatenbank Deutschland mit den korrekten AGS-Gebietsschlüsseln gemäß Gemeindeverzeichnis angeboten. Von den Änderungen sind alle Gemeindetabellen für die Länder Niedersachsen und Rheinland-Pfalz betroffen, welche bisher länderspezifische Gebietsschlüssel enthielten. Die Ebene der Gemeindeverbände (LAU 1) wird für beide Länder nicht mehr nachgewiesen, da diese Ebene nicht im AGS abgebildet wird.

So unless these ids are somehow necessary for backward compatibility, I think they can be omitted from dg_region/regions.csv.

More importantly, however, the correct AGS/ids are missing for Gemeinden in Gemeindeverbänden in the abovementioned Bundesländer. Tostedt's actual AGS is 03353035, Cochem's is 07135020, for example, see e.g. Destatis' Gemeindeverzeichnis. These are also the ones used by Regionalstatistik, see e.g. table 12411-01-01-5.

I strongly suspect that this issue has to be fixed upstream as well. For example

{
  region(id: "03353035") {
    id
    name
    BEVSTD {
      year
      value
    }
  }
}

returns "region": null on https://api-next.datengui.de/graphql, whereas the same query with id: "03353406" or "03353406035" returns id and name (but no BEVSTD data). The same is returned for Gemeinden in Gemeindeverbänden by the following query (the first six municipalities are not in Gemeindeverbänden, so their AGS is correct and hence data is returned).

{
  allRegions(itemsPerPage: 50) {
    page
    total
    itemsPerPage
    regions(lau:1, parent:"03353") {
      id
      name
      BEVSTD(year: 2017) {
        year
        value
      }
    }
  }
}

The corresponding query on tabular.genesapi.org returns the appropriate data, but shows the ids as labels for the Gemeinden in Gemeindeverbänden and not their names.

Querying for the region id directly returns data for 03353035 and not data for 03353406 or 03353406035.

lhehnke commented 5 years ago

@KonradUdoHannes @AlexandraKapp Might be relevant for datenguidepy as well.

dpprdan commented 5 years ago

What is the source of regions.csv anyway? Is it possible to retrieve that information directly from the API as well?

lhehnke commented 5 years ago

What is the source of regions.csv anyway? Is it possible to retrieve that information directly from the API as well?

It's the same .csv file datenguidepy uses: https://github.com/CorrelAid/datenguide-python/tree/master/datenguidepy

KonradUdoHannes commented 5 years ago

regions.csv is a snapshot from the API that is simply stored for processing reasons. Essentially it can be obtained by querying all regions without any statistics and then analyzing the results with respect to the id codes. Most importantly it simplifies the parent behavior of the all regions query. The parent argument in the all regions query of the underlying API is a slight misnomer as it behaves not as parent but as ancestor in general yielding all lower hierarchy levels. The parent column in the regions.csv file only makes the connection to direct descendant, i.e. child regions, in order to enable a more granular selection of regions.

I'm also not sure the general "id" issues is something we want to address, but it might depend on whether I understand it correctly. In general the wrapper aims to supply full fidelity data from regionalstatistik.de, not doing any data cleaning on our side, leaving this to the user of the package. In combination with region definitions/ids not being constant throughout time, this leads to situations where one might find regions ids that have some historic data but no recent data anymore. When looking at regions in isolation without statistics and years one might therefore notice regions that appear several times, although I would expect that for a given statistic and year there is only data present for at most one of the occurrences.

dpprdan commented 5 years ago

with region definitions/ids not being constant throughout time, this leads to situations where one might find regions ids that have some historic data but no recent data anymore. When looking at regions in isolation without statistics and years one might therefore notice regions that appear several times, although I would expect that for a given statistic and year there is only data present for at most one of the occurrences.

Correct. E.g. you will find data for "03152" (Landkreis Göttingen) and "03156" (Landkreis Osterode am Harz) up until October 2016 and for "03159" (the new Landkreis Göttingen) as of November 2016.

[Der Landkreis Göttingen] entstand am 1. November 2016 durch die Fusion des bisherigen Landkreises Göttingen mit dem Landkreis Osterode am Harz. https://de.wikipedia.org/wiki/Landkreis_G%C3%B6ttingen

And I agree that it is essential that regions.csv contains all of these. Or more generally: The regions.csv/regions dataframe should contain all AGS/ids which were in use since 1995 (?) until today and for which one can still retrieve data from regionalstatistik.de.

But a change in ids due to a Gebietsänderung is not the issue here. The issue is that regionalstatistik.de dropped support for länderspezifische Gebietsschlüssel for Gemeinden in Gemeindeverbänden in Niedersachsen and Rheinland-Pfalz and that this change is not reflected in regions.csv. I.e. regions.csv contains outdated ids that are not used by regionalstatistik.de anymore (i.e. you cannot retrieve any data with these anymore). In addition, regions.csv does not contain the correct AGS for these municipalities, which you now need to retrieve data. E.g. it does not contain "03353035" for Tostedt. That's why I think this needs to be addressed. (regionalstatistik.de dropping support for this is a good thing BTW, because the ids they used were not used by anybody else AFAIK. Now they are using standard AGS). If I am not mistaken we are talking about >3900 incorrect ids (all `id's with 7, 10 or 11 chars and those with 8 chars which have a parent with 11 chars).

regions.csv is a snapshot from the API that is simply stored for processing reasons. Essentially it can be obtained by querying all regions without any statistics and then analyzing the results with respect to the id codes.

I guess you are refering to https://github.com/CorrelAid/datenguide-python/blob/master/allRegionsAnalysis.ipynb, @KonradUdoHannes? Anyway, since the data stems from the datenguide API it has to be fixed there first. @crijke, did I assume correctly, that https://github.com/datenguide/datenguide-api/issues/11 is the corresponding issue? Does this also fix the issue for https://github.com/datenguide/genesapi-tabular?

dpprdan commented 5 years ago

Relatedly there is the issue of keeping the regions info up-to-date in the R and Python packages. Apart from more general changes like this one, there are also Gebietsänderungen happening two to three times a year on average, some of which also change the AGS for some municipalities, seldomly also counties (like LK Göttingen mentioned above). Those changes will come apparent in the data roughly half a year to one year after the official change, when Destatis publishes data for the first time with the new AGS/ids.

Does anybody know a clever way to automatically keep the region info in the packages in sync, short of manually updating the packages? Could that info be made available on the API side more easily (i.e. faster) for example, so that the packages can retrieve a current copy at the start of the session, @crijke? Or am I overthinking this?

KonradUdoHannes commented 5 years ago

The python package has (so far undocumented) functionality to update the regions.csv file. The reason this does not happen automatically at the moment is that the run time is currently sub optimal and it takes about 2-3 minutes, which I don't want to bother the user with on every package load. I think there are straightforward ways to improve the run time and then maybe update the .csv on package load, but I just didn't have the chance to implement it yet. As far as I know the R package currently uses a regions.csv snapshot generated inside the python package. Improvements here would be a matter of re implementing the same functionality in R.

As for the datenguide/datenguide-api#11 issue I believe that is related buy might not be the only source of differences for regions. The other source will probably be synchronization between datenguide and regionalstatistik.de. The datenguide API provides only access to a snapshot of regionalstatistik.de. Consequently recent changes might not be reflected before the snapshot is updated again. Currently this happens regularly but not on a fixed schedule as far as I know.

I'm not sure how the relationship is to the gensisapi-tabular issue, but it is in part a reason we created the regions.csv that turned the API behavior, which interprets parent as ancestor into a 1-generation parent relationship. The hope is that this can be used to more robustly identify the regions one wants to query and then simply loop over their ids using the R or python package and simple single region requests. In particular one can worry about the intricacies of the regions before doing the actual data request, which might be beneficial for larger requests.

@dpprdan Thanks for all the feedback :+1:

dpprdan commented 5 years ago

The datenguide API provides only access to a snapshot of regionalstatistik.de.

The datenguide API should be updated nightly now: https://github.com/datenguide/datenguide-api/issues/107

it takes about 2-3 minutes

Is that mostly the response time from the API? That's what I was assuming when I wrote:

Could that info be made available on the API side more easily (i.e. faster)

Getting the direct parent should be pretty straightforward (now that regionalstatistik.de is not using these weird shortened Regionalschlüssel anymore), since it is encoded in the AGS. The only problem is the Statistische Regionen (AGS of length 3), which do not exist everywhere, so you'd have to use a conditional there.