covidatlas / coronadatascraper

COVID-19 Coronavirus data scraped from government and curated data sources.
https://coronadatascraper.com
BSD 2-Clause "Simplified" License
364 stars 179 forks source link

Scraper for DEU (Germany) regions #79

Closed lazd closed 4 years ago

lazd commented 4 years ago

We need regional data for DEU.

hyperknot commented 4 years ago

datawrapper.de writes in: https://blog.datawrapper.de/coronaviruscharts/#considerations

For the state map of Germany, we use numbers from the Robert Koch Insitute. This source is official, but updated slowly in comparison to e.g. this map by ZEIT Online. The exact German locations of coronavirus cases came from René Engmann and his website coronavirus.jetzt. He and his team collected the data in this Github repo. and were amazingly quick at updating the data about Germany until they stopped doing so on March 11th. We still left the out-dated maps in here.

jgehrcke commented 4 years ago

I also found that zeit.de is right now the best source in Germany. I am also using that in https://github.com/jgehrcke/covid-19-analysis/blob/master/screenshot.png

I think they (zeit.de) collaborate with individual state ministries and therefore the numbers are pretty official. They are however much more real-time than the numbers published by Robert Koch Institute (the RKI needs about a day before they process and publish).

jgehrcke commented 4 years ago

I am going to help with that @lazd -- my email address is jgehrcke@googlemail.com

jgehrcke commented 4 years ago

Unfortunately I could not reach the people behind https://www.coronavirus.jetzt/. Right now zeit.de provides a larger count than coronavirus.jetzt. Given that zeit.de is one of the largest newspapers in Germany (of quality) I will proceed under the assumption that zeit.de is one of the best data sources to bet on for the coming hours, days, and weeks.

jgehrcke commented 4 years ago

Here we go, a preview:

$ curl https://covid19-germany.appspot.com/now 2> /dev/null | jq
{
  "last_update_from_source_iso8601": "2020-03-17T20:03:41+00:00",
  "source": "zeit.de",
  "total_cases_confirmed_until_now": 9293
}

I did build https://covid19-germany.appspot.com/now and plan to maintain it. Goal: data quality, data freshness, and a stable interface towards scrapers such as yours here!

Right now the source is, as argued above, zeit.de. Zeit.de gets their data from the individual health ministries in the federal states of Germany.

This is served by Google App Engine. That is, I expect great availability. Code can be found here: https://github.com/jgehrcke/covid-19-germany-gae


Edit: For the historical data / time evolution for Germany I think https://github.com/CSSEGISandData/COVID-19 is still very fine.

--

Edit 2: working towards an HTTP interface for a scraper for this project.

--

Edit 3: current HTTP API response example:

$ curl https://covid19-germany.appspot.com/now 2> /dev/null| jq
{
  "current_totals": {
    "cases": 9348,
    "deaths": 25,
    "recovered": 72,
    "tested": "unknown"
  },
  "meta": {
    "contact": "Dr. Jan-Philip Gehrcke, jgehrcke@googlemail.com",
    "source": "zeit.de (aggregates data from individual ministries of health in Germany)",
    "time_source_last_consulted_iso8601": "2020-03-18T00:11:24+00:00",
    "time_source_last_updated_iso8601": "2020-03-17T21:22:00+01:00"
  }
}

Working on a PR now for adding a scraper for that.

hyperknot commented 4 years ago

@jgehrcke can you get their county level data? As is visible on this page: https://www.coronavirus.jetzt/karten/deutschland/

jgehrcke commented 4 years ago

@hyperknot I am working on that. There is no official credible non-PDF data source for county level.

The data on the site you have linked is out of date. I tried to contact the people behind that yesterday and the day before, without success. I consider this project dead (I think coronavirus.jetzt has never been serious in the first place, at least in terms of data).

Germany's data processing is stone-age.

The data comes from the individual Gesundheitsämter, of which we have hundreds. Individual data points are then aggregated by authorities on county level. Here we need to hook in. I am on that, and I will expose time-resolved county-level data via https://covid19-germany.appspot.com, too.

Until then the country-level is good enough, isn't it? See https://github.com/lazd/coronadatascraper/pull/117

hyperknot commented 4 years ago

@jgehrcke of course, thanks a lot for the effort! I didn't think it'd be so difficult.

jgehrcke commented 4 years ago

Announcing the HTTP API that provides time series data for individual German states: https://gehrcke.de/2020/03/covid-19-http-api-german-states-timeseries/

Wikunia commented 4 years ago

Data per county is available here https://npgeo-corona-npgeo-de.hub.arcgis.com/datasets/917fc37a709542548cc3be077a786c17_0

Wikunia commented 4 years ago

This data is only a snapshot of the current state so don't know how to create a timeseries for it

hyperknot commented 4 years ago

All our sources are only snapshot, the scraper takes care of saving previous history.

Wikunia commented 4 years ago

As there is a scraper for DEU already what is the best way to add this? Adding another folder in scrapers?

hyperknot commented 4 years ago

I don't know, we should ask @lazd about that.

jgehrcke commented 4 years ago

@Wikunia Hey only now seeing that our work overlapped a little bit here (I did have this in progress, but first worked on a consolidated, fresh data source itself, as I wrote above.... woopsie). Thanks for your help!

Two remarks about the data set that we want to use / ingest here:

Wikunia commented 4 years ago

Hi @jgehrcke yes I just found the data and created a scraper out of it without thinking too much about it :smile: I think this community effort is awesome but it creates overlapping things like this where one can't precisely define which one is better. Maybe some people want the rki data as a reliable source which lags behind a bit but in a few weeks this doesn't matter anymore. The county level might be interesting for some but the general international public probably doesn't care. It would be nice to be able to have all datasets in but maybe have an api where such criteria can be mentioned. Maybe @lazd has some comments about it.

jgehrcke commented 4 years ago

Okay, thanks @Wikunia.

If in a couple of weeks it turns out that as of certain developments we'd like to switch the data source then we can certainly do this. I am not married to maintaining a data set :D, there are nicer things in life. However, focusing on the next days, though, for keeping the momentum (and supporting the momentum with quality data!), let's keep using the fresher data as published by the individual Gesundheitsministerien. Is that okay with you? Also, of course, in terms of willingness and also ability to curate things for now I'd prefer to continue using https://github.com/jgehrcke/covid-19-germany-gae.

dadosch commented 4 years ago

Please see https://github.com/corona-zahlen-landkreis/corona_landkreis_fallzahlen_scraping/tree/master/landkreise/data We scrape county's website to have a more up to date data per county or in finer levels. The identifier is the official identifier https://de.wikipedia.org/wiki/Amtlicher_Gemeindeschl%C3%BCssel

martiL commented 4 years ago

@lazd Is there are a data scraper for german regions? I see open pull request but I don't see it in the data

jgehrcke commented 4 years ago

@lazd Is there are a data scraper for german regions? I see open pull request but I don't see it in the data

Should be a matter of hours now. We iterated on the implementation quite a bit. https://github.com/lazd/coronadatascraper/pull/196

Edit: merged! :rocket:

martiL commented 4 years ago

@lazd Is there are a data scraper for german regions? I see open pull request but I don't see it in the data

Should be a matter of hours now. We iterated on the implementation quite a bit. #196

Edit: merged! 🚀

@jgehrcke I am so grateful for everything you do! We are trying to create a landing page and API to make it easier for people to consume and contribute data to this amazing project :-)

https://corona-api-landingpage.netlify.com/

martiL commented 4 years ago
Bildschirmfoto 2020-03-24 um 14 11 31

@jgehrcke @lazd There seems to be wrong coordinates for the German region BB... you can reproduce it here -> https://corona-api-dashboard.netlify.com/

jgehrcke commented 4 years ago

@martiL woof! I tried to double-check things and already knew that this is probably the most error-prone part :-) Fix here: https://github.com/lazd/coronadatascraper/pull/304

jgehrcke commented 4 years ago

@lazd @qgolsteyn I think we should close this issue for now ('done'), and create more fine-grained issues from here as they come up!

qgolsteyn commented 4 years ago

Sounds good!