HertieDataScience / SyllabusAndLectures

Hertie School of Governance Introduction to Collaborative Social Science Data Analysis
MIT License
37 stars 60 forks source link

Using DeStatis dynamically #52

Closed LisaKatharina closed 8 years ago

LisaKatharina commented 8 years ago

Dear all,

Christopher and me are trying to work with DeStatis (Federal Statistical Office) data and have troubles to dynamically load it. To get to a specific data set you have to click through several pages and at the end there is a button for each of the file formats (see here e.g.: https://www-genesis.destatis.de/genesis/online/data;jsessionid=89DA4117413DDF218439C1FCC916F34B.tomcat_GO_2_1?operation=abruftabelleBearbeiten&levelindex=2&levelid=1445934305668&auswahloperation=abruftabelleAuspraegungAuswaehlen&auswahlverzeichnis=ordnungsstruktur&auswahlziel=werteabruf&selectionname=13231-0001&auswahltext=&werteabruf=Werteabruf). There is no link to the respective button. DeStatis offers an API, but it only is available to registered users which comes with a fee of at leat 50 euros/year (https://www-genesis.destatis.de/genesis/online;jsessionid=89DA4117413DDF218439C1FCC916F34B.tomcat_GO_2_1?Menu=Webservice). Does anyone of you have the same problem? I guess if they are offering an API, retrieving the data "manually" is not that easy... Any help is much appreciated!

All the best Lisa

christophergandrud commented 8 years ago

It's really unfortunate that a public agency like DeStatis charges for the use of its API.

I'm not sure if you'll be able to get around this, but as a first attempt I right-clicked on the CSV icon on the DeStatis link above and went to View Source. This is what I found:

destatis_post

You can see that clicking the link actually makes a POST request to their RestAPI (see last lecture). The URL following action contains the parameters of that POST. Take a look at a discussion in another thread (https://github.com/HertieDataScience/SyllabusAndLectures/issues/43) for links to possible ways to interact with this API.

Granted, I don't know if it will be possible to actually do this or if DeStatis make it impossible without paying. Worse comes to worse for this class, download the data manually and fully document the process in your Repo. (Also, advocate that DeStatis adopt public data best practices :smile:)

LarsMehwald commented 8 years ago

Dear Lisa, Daniel and I encountered the same problem. After working on it, the following works for us and data about German graduates from the site https://www.regionalstatistik.de (quite similar to DeStatis):

Retrieving the URL

1) Inspecting the CSV image 2) Manually extract the URL that is contained in form 3) Inserting the URL into the browser 4) Waiting until the browser is redirected to another URL (in order to retrieve the file) and displays the SAVE AS option 5) Again, extract manually this new (redirected) URL from the browser (this URL can be used in R)

R code

URL <- "https://www.regionalstatistik.de/genesis/online/data/192-71-4.csv;jsessionid=F5CCA2AABEA06725D3A3C95CEC32EAE1?operation=ergebnistabelleDownload&levelindex=3&levelid=1445953287057&option=csv&doDownload=csv&contenttype=csv" Graduates <- read.csv(file = URL, sep=";", na.strings=c("-", "."), header = FALSE, skip = 10, nrows = 524, col.names = c("Year",

some more names

                                "GraduatesWithHochschulreifeDegreeFemale")
                  )

The options used help to format this untidy CSV into a good data frame: Skip leaves out the first rows (weird header), nrows limits the number of rows read in (and thereby getting rid of the untidy part in the end of the document), header and col.names formats the column names correctly.

A problem however remains: the link is only valid for a certain period of time (I ran it once and it worked, I ran it shortly afterwards and the link was not valid anymore). So fully documented, but not replicable..

I hope this helped.

christophergandrud commented 8 years ago

Thanks @LarsMehwald, good documentation. My guess about the time limit has to do with the jsessionid. They probably give each access to the site an ID that only works once or for a limited period of time. To get a persistent ID you likely have to pay for access to the API.