covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Add scraper for Hungary #406

Closed jzohrab closed 4 years ago

jzohrab commented 4 years ago

Original issue https://github.com/covidatlas/coronadatascraper/issues/660, transferred here on Saturday Apr 04, 2020 at 16:41 GMT


Location name

Hungary

Source URL

https://docs.google.com/spreadsheets/d/1e4VEZL1xvsALoOIq9V2SQuICeQrT5MtWfBm32ad7i8Q/edit#gid=311133316

Notes/comments

Entered daily from the official government website, which uses images.

jzohrab commented 4 years ago

(Transferred comment)

Where's the official government site? Who is maintaining the spreadsheet?

jzohrab commented 4 years ago

(Transferred comment)

The government website is https://koronavirus.gov.hu/

The country numbers are 100% mirrored in JHU, so I belive it makes no sense to mirror it only for the country data.

What is not in JHU are the county case numbers, but they are published on the website as an image, so it's not possible for this project to scrape it, until they start publishing it in a human readable form. That spreadsheet could be a workaround for the image numbers, but it's only reliable as the owner of the spreadsheet keeps entering the numbers.

jzohrab commented 4 years ago

(Transferred comment)

out of curiosity, anyone try to run something like this through tesseract?

jzohrab commented 4 years ago

(Transferred comment)

This might be quite doable with tessarct, I believe, but the project's policy so far is not to do OCR.

jzohrab commented 4 years ago

(Transferred comment)

Where are the policies published?

jzohrab commented 4 years ago

(Transferred comment)

Just informally on the Slack channel. There was a scraper which was trying to use tesseract and I think it's still under review.

If you feel like doing a tessarcts sample script for that image and submit a PR, it might be a good reference for other scrapers. But I cannot promise it gets merged, it depends on the team.