Closed ghostleyjim closed 4 years ago
Hi Jim, this is quite an issue :(
The main source of information is the daily pdf report from now on. This PDF is full of tables and changes in structure from day-to-day. I'm now thinking about generating new datasets based on these PDF's but need the help of you folks.
I'm thinking about a dataset with
After making those new datasets, we have to update the plot code @japhir.
please beware that their csv headings are mixed up:
Gemnr;Gemeente;BevAant;Aantal;Aantal per 100.000 inwoners
796;'s-Hertogenbosch;67;155113;43.2
BevAant and Aantal are mixed.
I 'fixed' this in our parser by doing:
let count = getValue(row["Aantal"]);
let p100k = getValue(row["Aantal per 100.000 inwoners"]);
let bew = getValue(row["BevAant"]);
if (p100k === 0) {
count = 0;
} else if (bew < count) {
count = bew;
}
Since the file I have for the 30th has the correct headings/data pair not really a fix but it works
This is nuts, why are they suddenly switching to only publishing hospitalised cases?
I'm receiving a lot of emails on this issue at the moment. I will start working on this (around 8 pm CEST). I will add new datasets and a plan to maintain this repo. Please drop ideas in this issue.
Put the question on twitter. See if their social media team does something with this... If you need any help with something let me know. However I am not a pro software developer (discord Corona bot was my first python project) so don't expect very fancy code from me. I will continue searching if I can pull the information from somewhere else, as soon as I find something you will be the first to hear.
PyPDF2 looks promising for extracting stuff from PDFs. I'll check if its feasible for the RIVM reports right now.
EDIT: textract (https://textract.readthedocs.io/en/latest/) works for extracting, but it looks like the layout of the pdf is not entirely consistent between days (e.g. extra newlines). Scraping is going to be a challenge. I'm not sure if it would be possible to extract information from the figures.
Here's a start: https://github.com/J535D165/CoronaWatchNL/blob/master/parse_pdf_report.py
databronnencovid19@rivm.nl or communicatieloket@rivm.nl give them an email? But not with 50 people at the same time... Do one of you guys want to send it?
databronnencovid19@rivm.nl or communicatieloket@rivm.nl give them an email? But not with 50 people at the same time... Do one of you guys want to send it?
Already did yesterday morning:
from: David Stotijn dstotijn@gmail.com to: databronnencovid19@rivm.nl date: Mar 30, 2020, 9:56 AM subject: Historische (time series) COVID-19 data mailed-by: gmail.com
Beste redactie,
Naar aanleiding van de publicatie van https://www.databronnencovid19.nl/ zou ik graag het volgende willen voorleggen:
De informatie te downloaden op https://www.rivm.nl/nieuws/actuele-informatie-over-coronavirus bevat cumulatieve data van gemelde gevallen, per gemeente. In de dagelijkse "Epidemiologische situatie COVID-19 [datum].pdf" bestanden wordt daarnaast benoemd: gemelde overleden patiënten, leeftijdsverdeling, geslachtsverdeling, onderliggende aandoeningen. De data uit deze PDF's staat weliswaar in tabellen, maar omdat de opmaak van de PDF's af en toe wordt bijgewerkt is het lastig de data geautomatiseerd te "scrapen".
Mijn vraag: Kan RIVM een databron beschikbaar stellen met de historische data als "time series" datapunten? Dat wil zeggen: data vanaf 27 februari j.l. per dag, eventueel retroactief gecorrigeerd, met aantal gevallen (uitgesplitst over gemeente, leeftijdsgroep, geslacht), overleden patiënten (uitgesplitst over leeftijdsgroep, geslacht, comorbiditeit), en ziekenhuisopnamen (uitgesplitst over leeftijdsgroep, geslacht, comorbiditeit).
Zo'n databron (bijv. een index met verwijzing naar dagelijkse CSV-bestanden) zou fantastisch zijn als bron voor statistische analyse en visualisatie; zowel voor datawetenschappers, journalisten alsmede burgers met een interesse in Nederlandse COVID-19 data. Een voorbeeld van een dergelijk initiatief vanuit de overheid in Italië: https://github.com/pcm-dpc/COVID-19.
Alvast hartelijk dank voor uw reactie, Met vriendelijke groet,
David Stotijn
Didn't get a response yet 😢
@lkleuver They published a new version of the data with corrected column names (19:20).
@ghostleyjim @dstotijn We were in contact with RIVM. It wasn't very satisfying so far. We offered them a team of data experts from Utrecht University to set up a data repository like the one in Italy. I will keep you updated.
I published 3 new datasets. One with the case counts per province (complete time-series), one with the hospitalized patients per municipality and one with age groups. What should we add next?
Please see the Readme for the overview.
Let's see what happens tomorrow at 2 p.m. ... Hopefully, all of your emails will result in clean and standardized datasets (not published in PDF format).
Stay safe.
I see that you move case counts per municipality into Inactive category. Is it temporary move for while you are discussing situation with RIVM? Or did they confirm that they are not planning to share these numbers anymore?
Not sure yet. I expect them to discontinue those numbers. There are some arguments to do so, but let see what happens today.
Ik heb via m'n vrouw het telefoon nummer van de strategie officer van het RIVM. Deze ga ik hier uiteraard niet delen, maar @J535D165 heb jij interesse om te bellen/appen?
Just dropping a note here that you guys are on top of things and I just would like to thank you for that :)
@lkleuver Please send me an email.
Thanks for publishing these! I was making daily clips/gifs with cases per municipality (and per 100.000 per municipality) of the number of confirmed cases. Let's hope they standardize the publishing. At least it's not as bad as the UK (publish per health board, but there are no geographic files publicly available of healt boards, neither are number of people living in each board).
Hi Jim, this is quite an issue :(
The main source of information is the daily pdf report from now on. This PDF is full of tables and changes in structure from day-to-day. I'm now thinking about generating new datasets based on these PDF's but need the help of you folks.
I'm thinking about a dataset with
- Count cases per province.
- Hospitalized patients for each province.
- Age distributions
- ... (suggestions?)
Thanks for all the information that you have published so far. Unfortunately, RIVM changed confirmed per municipality into hospitalized, while they have information per municipality for confirmed, hospitalized and deaths in their epidemiologic report.
But still, you can't do anything about that and only hope they will publish this information as well.
One suggestion for your dataset is fatalities and distribution with underlying diseases.
Update:
I will try to get in contact with RIVM again. In the meanwhile, Utrecht University allocated some time to maintain this repo.
@J535D165 Have you tried to request the data via https://data.overheid.nl/? I have a contact there I can check with. There are already a few groups/people scraping for data when it is available in a database somewhere.
I hope RIVM will listen to you, and publish the CSV file wiht #infected/municipality again, but I have low expectations since I also emailed them once to ask if they could deliver their data in a more 'modern' way (e.g. like Italy does) and after 5 days(!) I got an answer saying that they think it\s already quite nice that they offer a daily CSV-file. Buth thy changed the format of this CSV quite often, and now even changed the content, not thinking of possible 'automatic consumption' of their data. But they (of course) still do have the #infected/municipality data, as you can see in their PDF on https://www.rivm.nl/actuele-informatie-over-coronavirus/data), see Map 2, they just dont share this data in CSV format anymore.
Regarding PDF-readers (to extract data of for RIVM.pdf), I worked (and have good experience) with this (free) tool: pdftoexcel
And maybe you could also contact @datadista, maker of this COVID19-in-Spain repo: https://github.com/datadista/datasets/tree/master/COVID%2019
I also used their dataset to make some (PowerBI)reports: covid-19-analysis-of-number-of-deaths coronavirus-covid-19-in-spain-power-bi
@J535D165 ik heb je vorige week een mail gestuurd, is deze goed aangekomen? Sorry dat ik via die kanaal vraag maar voelt alsof het in de spam is gekomen. [plz delete]
Hi all. RIVM made the disease counts available again. It will take some time to integrate this again in the repository. I will add it to the raw_data folder first.
Finally, the data is there (almost complete...). Check out https://github.com/J535D165/CoronaWatchNL/pull/119
Hello Jonathan,
Quick heads up! as you will find on the RIVM website they changed the data from confirmed cases to number of people hospitalized in the municipalities.
Kind regards, Jim