ManuelB / covid-19-vis

This repository contains data visualizations based on RKI and DIVI using kepler.gl
Apache License 2.0
26 stars 11 forks source link

Use an html parser instead of pupeteer to crawl data from divi register #2

Closed kommander closed 4 years ago

kommander commented 4 years ago

Is your feature request related to a problem? Please describe. Pupeteer is a large overhead to run the scraper as a periodic crawling service.

https://github.com/ManuelB/covid-19-vis/tree/gh-pages/germany/divi-intensivregister-scrapper

Describe the solution you'd like Use something like node html parser

ManuelB commented 4 years ago

I currently copy the data manually from the "Kartenansicht" https://www.divi.de/register/kartenansicht Vega source. This data from the visualization is a lot more detailed. The simulation from yesterday can be seen here:

https://kepler.gl/demo?mapUrl=https://raw.githubusercontent.com/ManuelB/covid-19-vis/gh-pages/simulation/2020-03-28_Landkreise_Intensivbetten_Strong_Mitigation-1-month-keplergl-cache.json

I did not work on automation yet because the sources are changing too fast.

I will close this issue but feel free to send me a pull request for an alternative way of scrapping and I will integrate it.

kommander commented 4 years ago

Thanks for the fast answer. I am currently considering providing this as an API service while using it myself. Looking for someone who may have already done it, before I sink work into it. This is not a critique, just a call for help. Leaving the issue open might help finding others. Thanks a lot for providing this!

EDIT:

I did not work on automation yet because the sources are changing too fast.

That's why I want to optimise it, to run it frequently for update. Contacted DIVI as well if I can get the raw data and scale it via an API on our servers.

ManuelB commented 4 years ago

@kommander you are doing a fantastic job!

ManuelB commented 4 years ago

@kommander I just figured out that the kartenansicht does not contain the data anymore :-( I will try to concat the DIVI and tell them that this data is crucial for decision makers.

kommander commented 4 years ago

That's what I was afraid of, too much traffic for them. If you can get them to give us the raw data via a secret API, I can make it public via our API within short time.

ManuelB commented 4 years ago

@kommander I don't think that it is a traffic problem. I think it is a data privacy issue.

kommander commented 4 years ago

They've been publishing it before, where does the sudden privacy concern come from? People avoiding areas with high infection rates should be good? Maybe something I am not seeing...

EDIT: Kartenansicht still works for me btw. EDIT2: view-source:https://diviexchange.z6.web.core.windows.net/report.html ist still there.

ManuelB commented 4 years ago

@kommander yes it is still there and it still contains all hospitals but the fields ICU low care free, ICU high care free and some others are not available anymore on the hospital level.

I would also guess it was published by accident. Nevertheless I already copied it and published the data from yesterday in a machine readable format in this git repository.

https://github.com/ManuelB/covid-19-vis/blob/gh-pages/germany/divi-kartenansicht/DiviKartenansicht.csv

People might just forecast the data on general available data.

kommander commented 4 years ago

@ManuelB I see... 🤔Any public statement already? Can't find any. Afraid that people frequent hospitals with free ICUs/ECMOs more?

ManuelB commented 4 years ago

I don't think that I will get an answer on sunday. I tried to contact multiple organizations (Bundeswehr, Bundesministerium für Gesundheit, Robert-Koch-Institut, Statistisches Bundesamt) I would guess it is currently very difficult for them to distinguish between serious requests and fake news.

ManuelB commented 4 years ago

I also currently try to get an educated answer from the Mailinglist of Alumnis of the "Deutsche Schülerakademie" to the question if it is dangerous to publish such data.

kommander commented 4 years ago

Yes, they are probably getting hundred of inquiries. I wrote a kind email and am waiting patiently to get a reply, trying not to bother them by phone. Meanwhile trying to find people that have contacts. I am now scraping whats there with an html parser anyway and will provide an API for it...

kommander commented 4 years ago

@ManuelB Hier mal die API, einmal die Stunde geupdated, per API immer die aktuellsten https://warte.app/api/divi/icu-facilities