CAVaccineInventory / vaccine-feed-ingest

Tools to download and aggregate feeds of vaccination clinic location information in the United States.
MIT License
26 stars 45 forks source link

AL/Jefferson some entries do not have names #717

Closed MoralCode closed 3 years ago

MoralCode commented 3 years ago

The source PDF seems to suggest that the same name and link belong to all the addresses listed beneath them

Screenshot_20210605_231807

however this is not the way it is being parsed. Expanding the full sample from #697 shows that there are entries without names, which is causing other apps like Velma that display these names to not display anything

CC @adityasharad

adityasharad commented 3 years ago

Thanks for bringing this up. This is a known limitation of the parser for the AL/Jefferson data (see discussion in https://github.com/CAVaccineInventory/vaccine-feed-ingest/pull/659). Unfortunately, last time I checked, the name headers you see in the Jefferson County PDF are images rather than text, so a text-based parser doesn't actually know what the names are.

I agree this is not ideal. Do you happen to know if the provider name information is available elsewhere, or have a recommendation for the most appropriate output format even if we don't know the names? Would it be better to produce an empty string rather than a missing name?

MoralCode commented 3 years ago

@adityasharad looks like they have a page of Testing sites in ArcGIS: https://data-jeffco-al.opendata.arcgis.com/pages/covid-19-in-jefferson-county-alabama-testing-locations

the information to set up scraping for this is available at https://services7.arcgis.com/4RQmZZ0yaZkGR1zy/ArcGIS/rest/services/COVID19_testsites_READ_ONLY/FeatureServer

Since it seems like there are at least a few addresses that match between the PDF and this list, it might be worth grabbing this list of testing sites and using it to look up the name of a location for each address scraped from the PDF.

Idk if it will cover 100% of the addresses, but it might get closer

also, it may be preferable to go to all the websites listed in the PDF and add scrapers for the providers websites themselves. this might be a better way to cover all the places in the PDF and get new ones (while possibly removing the need to scrape the PDF)

Do you know if the PDF changes often enough to warrant scraping it? or would it be better to have someone from web banking (or some kind of script like visualping.io) keep an eye on it and let someone know if it changes so the new providers sites can have scrpaers added?

MoralCode commented 3 years ago

Status of scrapers for location sources in the PDF:

adityasharad commented 3 years ago

Do you know if the PDF changes often enough to warrant scraping it?

Most recent updates were: 1 June, 25 May, 9 April. From a quick glance the site info within doesn't change very much.

Also, it may be preferable to go to all the websites listed in the PDF and add scrapers for the providers websites themselves.

Good point. I observe there's already some overlap with the ArcGIS data you're getting for AL, which has complete information for some of the well-known providers here. So where there is overlap, either relying on the ArcGIS pipeline or consuming from the provider sites directly sounds reasonable. I think that consuming the PDF is still useful for the lesser-known providers, because from some provider websites (e.g. Christ Health Center) it is not always obvious which of their sites are giving vaccines. And some just don't have a website at all, but it might still be useful to provide the phone numbers and addresses?

As a simpler approach, what do you think about hardcoding a few more mappings from URL to provider name in the normalizer? I've already got a few for the well-known providers, and if the Jefferson providers don't change much then this might do the trick.

MoralCode commented 3 years ago

As a simpler approach, what do you think about hardcoding a few more mappings from URL to provider name in the normalizer?

Im not sure. I think its probably not a major problem since it seems like the lack of names may have already been worked around at a later stage of the pipeline. I'm just a volunteer contributor like yourself so in all likelihood this issue might be an error on my part and may not be the correct solution to this problem

eliblock commented 3 years ago

Closing all issues: https://blog.vaccinateca.com/vaccinate-the-states-is-winding-down