Open zstumgoren opened 3 years ago
While https://napa-county.civicplus.com/AgendaCenter
is not valid, https://ca-napacounty.civicplus.com/AgendaCenter
-- which follows the general formula as other counties with civicplus.com domains -- is live. https://napa-county.civicplus.com/AgendaCenter
is a typo.
I've spent about an hour checking to see if any other CivicPlus sites with .gov or .org URLs do not correspond to a URL of the form stateabbreviation-agencyname.civicplus.com/AgendaCenter
and am yet to find an example. Here are two websites that demonstrate this point:
# Valid
https://www.ks25jd.org/agendacenter
# Also valid
https://ks-25thjudicialdistrict.civicplus.com/agendacenter
# Valid
https://www.chickasha.org/AgendaCenter
# But also valid
https://ok-chickasha.civicplus.com/AgendaCenter
However, I can't definitively prove that this is always true. A more comprehensive fix would be to have more robust site detection capability (not to be confused with the method discussed in #69).
At present, our method of identifying Agenda Center sites involves manually searching an online subdomain enumeration tool. We could develop a way to programmatically identify websites built using CivicPlus's Agenda Center product. More generally, in the future, we may want to automatically detect websites built using other meeting software, e.g., Legistar.
The best solution I can think of is to write a script that uses both a Google Custom Search API and subdomain enumerating libraries. The Google API could be used to detect, for example, the first 1,000 or so results for the searches site:.gov/AgendaCenter
, site:.com/AgendaCenter
and site:.org/AgendaCenter
. The enumerating libraries would merely search for all civicplus.com subdomains.
@DiPierro Thanks for digging into this! This sounds like good news -- i.e. it appears we can generally assume that CivicPlus sites have a working subdomain. It may be that our initial site discovery methodology which you describe unearthed URLs that are no longer valid, so it may simply be a matter of identifying and updating the canonical URLs for problematic sites in our canonical list of known CivicPlus sites. That list includes a lot of http
URLs rather than https
URLs. The fomer often seem to redirect to the latter, and can significantly slow down or outright break the scraping process. In the few cases I've tested, using the https
version of the site seems to fix the slowness/breakage, although the Napa County case is one where I didn't realize the site also had a working, standard URL that follows the expected pattern of https://<place>-<agencyname>.civicplus.com/AgendaCenter
(nice find on that!).
I think we can address this as a mixed task -- part coding and part research. We should be able to easily write a script that steps through all URLs and tests http
sites for redirects and/or equivalent https
URLs. The requests
library has support for checking redirect status and seems like the simplest initial approach. That should let us flag for additional research any URLs that do indeed redirect to https
or return 404s on https
.
That process should help us figure out if all CivicPlus sites that we're aware of have standard subdomains on CivicPlus and help us decide what, if any, changes are needed to address the "unique name" issue described in #80.
@DiPierro Do you want to take on that scripting/research as part of the aw-scripts library? Alternatively, we can flag this as a "help wanted" issue to see if we can find volunteers to take a stab.
Hi @zstumgoren - would you mind flagging this scripting/research task as "help wanted" for now? I'm not certain how much time I'll have in the coming week or so. The task strikes me as a good fit for other volunteers should they have interest, and I wouldn't want to delay. Thank you.
@zstumgoren I've started stepping through our list of CivicPlus domains using a modified version of generate_civicplus_sites.py so that we know we're using a clean list of domains. The script produces a csv that includes these fields:
requests.get(URL, allow_redirects = True)
history
attribute of a response object; blank if no redirects occurred, or a 302 status code if a redirect happened,Can you think of other fields I should be tracking? Should I separately pass each domain into civic-scraper
to see how if there are any problems?
Here's a csv merging the public list of URLs with the status_code, history, and alias fields described above:
https://docs.google.com/spreadsheets/d/19t6vnl514kUyoSHKq3rMVA8y3O_hQ6KXk-HiUBB78xo/edit?usp=sharing
Our list of ~1500 known Civic Plus sites largely run on subdomains of CivicPlus.
For example:
However, there appears to be at least one (and possibly others) that are only accessible via non-CivicPlus domains (presumably on a domain the government agency set up or manages itself).
Napa County is one known example:
This issue first cropped up in #63 and affects #80