biglocalnews / civic-scraper

Tools for downloading agendas, minutes and other documents produced by local government
https://civic-scraper.readthedocs.io
Other
44 stars 14 forks source link

Are there CivicPlus sites that run on non-CivicPlus domains? #82

Open zstumgoren opened 3 years ago

zstumgoren commented 3 years ago

Our list of ~1500 known Civic Plus sites largely run on subdomains of CivicPlus.

For example:

https://nm-lascruces.civicplus.com/AgendaCenter/

However, there appears to be at least one (and possibly others) that are only accessible via non-CivicPlus domains (presumably on a domain the government agency set up or manages itself).

Napa County is one known example:

# Broken CivicPlus subdomain
https://napa-county.civicplus.com/AgendaCenter

# Working AgendaCenter location
https://www.countyofnapa.org/AgendaCenter

This issue first cropped up in #63 and affects #80

DiPierro commented 3 years ago

While https://napa-county.civicplus.com/AgendaCenter is not valid, https://ca-napacounty.civicplus.com/AgendaCenter -- which follows the general formula as other counties with civicplus.com domains -- is live. https://napa-county.civicplus.com/AgendaCenter is a typo.

I've spent about an hour checking to see if any other CivicPlus sites with .gov or .org URLs do not correspond to a URL of the form stateabbreviation-agencyname.civicplus.com/AgendaCenter and am yet to find an example. Here are two websites that demonstrate this point:

# Valid
https://www.ks25jd.org/agendacenter

# Also valid
https://ks-25thjudicialdistrict.civicplus.com/agendacenter

# Valid
https://www.chickasha.org/AgendaCenter

# But also valid
https://ok-chickasha.civicplus.com/AgendaCenter

However, I can't definitively prove that this is always true. A more comprehensive fix would be to have more robust site detection capability (not to be confused with the method discussed in #69).

At present, our method of identifying Agenda Center sites involves manually searching an online subdomain enumeration tool. We could develop a way to programmatically identify websites built using CivicPlus's Agenda Center product. More generally, in the future, we may want to automatically detect websites built using other meeting software, e.g., Legistar.

The best solution I can think of is to write a script that uses both a Google Custom Search API and subdomain enumerating libraries. The Google API could be used to detect, for example, the first 1,000 or so results for the searches site:.gov/AgendaCenter, site:.com/AgendaCenter and site:.org/AgendaCenter. The enumerating libraries would merely search for all civicplus.com subdomains.

zstumgoren commented 3 years ago

@DiPierro Thanks for digging into this! This sounds like good news -- i.e. it appears we can generally assume that CivicPlus sites have a working subdomain. It may be that our initial site discovery methodology which you describe unearthed URLs that are no longer valid, so it may simply be a matter of identifying and updating the canonical URLs for problematic sites in our canonical list of known CivicPlus sites. That list includes a lot of http URLs rather than https URLs. The fomer often seem to redirect to the latter, and can significantly slow down or outright break the scraping process. In the few cases I've tested, using the https version of the site seems to fix the slowness/breakage, although the Napa County case is one where I didn't realize the site also had a working, standard URL that follows the expected pattern of https://<place>-<agencyname>.civicplus.com/AgendaCenter (nice find on that!).

I think we can address this as a mixed task -- part coding and part research. We should be able to easily write a script that steps through all URLs and tests http sites for redirects and/or equivalent https URLs. The requests library has support for checking redirect status and seems like the simplest initial approach. That should let us flag for additional research any URLs that do indeed redirect to https or return 404s on https.

That process should help us figure out if all CivicPlus sites that we're aware of have standard subdomains on CivicPlus and help us decide what, if any, changes are needed to address the "unique name" issue described in #80.

@DiPierro Do you want to take on that scripting/research as part of the aw-scripts library? Alternatively, we can flag this as a "help wanted" issue to see if we can find volunteers to take a stab.

DiPierro commented 3 years ago

Hi @zstumgoren - would you mind flagging this scripting/research task as "help wanted" for now? I'm not certain how much time I'll have in the coming week or so. The task strikes me as a good fit for other volunteers should they have interest, and I wouldn't want to delay. Thank you.

DiPierro commented 3 years ago

@zstumgoren I've started stepping through our list of CivicPlus domains using a modified version of generate_civicplus_sites.py so that we know we're using a clean list of domains. The script produces a csv that includes these fields:

Can you think of other fields I should be tracking? Should I separately pass each domain into civic-scraper to see how if there are any problems?

DiPierro commented 3 years ago

Here's a csv merging the public list of URLs with the status_code, history, and alias fields described above:

https://docs.google.com/spreadsheets/d/19t6vnl514kUyoSHKq3rMVA8y3O_hQ6KXk-HiUBB78xo/edit?usp=sharing