covidatlas / coronadatascraper

COVID-19 Coronavirus data scraped from government and curated data sources.
https://coronadatascraper.com
BSD 2-Clause "Simplified" License
363 stars 179 forks source link

Harris county tx #1036

Closed sglyon closed 4 years ago

sglyon commented 4 years ago

Summary

Adds hopsital and ICU bed usage for harris county TX.

This probably needs a bit more work, but I wanted to touch base with maintainers here to see if I'm on the right direction.

What it does:

Things I'm a little uncertain on are:

Thanks!

jzohrab commented 4 years ago

Hi @sglyon - thanks for the PR! Have been busy over here so haven't had time to get to it.

The callback stuff looks to be mostly ok. We're moving to a new repo (Li) in this same org, which has a completely different crawl and scrape model.

Questions and Answers:

Is adding this additional exported function to the fetchlib ok?

It should be, but there are few changes to this so I'll need to check!

How could/should we handle cache for this scraper? I'm not sure on the details of how caching is handled here. For now I am completely ignoring it.

Yes, good question. Ultimately your page does const data = await page.waitForXPath("//div[@aria-label='Grid']").then(getDataFromPivotTable);, but this is totally different from our usual methods. Tough stuff!

I only return those two current hospital bed usage data points, I don't have more fundamental results like cases, deaths, reported, etc. Those are aggregate level TX scraper. Is there a strategy for merging multiple scraper outputs so we have all the info?

Normally, we actually fetch things within one scraper and combine them in there. Ideally, the data would be normalized when inserted into the db or document, and then some other process would join them ... but that's not how it's done (yet)!

Great work, this is a non-trivial problem.

jzohrab commented 4 years ago

I opened https://github.com/valorumdata/coronadatascraper/pull/1 into this branch from my repo, it has some substantial revisions, but respects caching etc which we must follow. Take a look and let me know, thank you!

jzohrab commented 4 years ago

Closing in favor of https://github.com/covidatlas/coronadatascraper/pull/1040, which updates the work you started to follow this project's conventions.

Thank you for the PR! I hope 1040 gets merged soon so we can see how it behaves in prod ... at the moment I'm not sure how it will work.

sglyon commented 4 years ago

Thanks @jzohrab -- hopefully future contributions don't require so much hands on help from the core team!