biglocalnews / civic-scraper

Tools for downloading agendas, minutes and other documents produced by local government
https://civic-scraper.readthedocs.io
Other
43 stars 14 forks source link

Question about plans for data storage + extra event metadata #151

Closed evamaxfield closed 2 years ago

evamaxfield commented 2 years ago

Hello! πŸ‘‹

I am the project lead for Council Data Project. It seems like civic scraper would be incredibly useful in our own work as well in acting as quick way to spin up new CDP instances.

Currently we have instances for:

Seattle and King County both operate against Legistar which I see you have a scraper for. As we plan on scaling CDP, it would be great to adopt shared scraping functions (unless of course we have two different end goals).

So my questions:

  1. How much data do you plan for these scrapers to retrieve? Specifically, CDP scrapers require a bare minimum of video_uri, session_datetime, and a body name for storage but optionally collect voting information, legislative items, person (councilmember, etc.) information, legislative sponsorship history, elected seat and committee role info, etc. Is that in-scope for civic-scraper?
  2. Is there currently a plan for the storage of data collected by civic scraper? Our ingestion / scraper model is different from our data storage model and I am curious if you all have made a plan for that.
  3. Are your scrapers only past / archived meetings or do they also include planned meetings?

Really exciting project and I am willing to bet we will have lots of places for collaboration.

zstumgoren commented 2 years ago

Great to meet you @JacksonMaxfield and thanks for reaching out! Definitely seems like there's potential for collaboration between our projects.

At the moment, our scrapers primarily focus on supporting the ability to gather metadata and documents about local government meetings from Legistar, CivicPlus, and a few other platforms used by agencies. Specifically, we try to capture metadata such as the:

And of course, the framework provides the ability to download those document assets.

We've discussed the idea of gathering meeting recordings (e.g. video or audio) where available as well, and wherever any of our scrapers don't currently support gathering those resources we'd definitely be open to updates that add that functionality. We don't have any immediate plans to scrape structured vote information from minutes or capture that structured data when made readily available by sites such as Legistar. But I believe DataMade (the consultant helping us build out the civic-scraper framework) and the wider OpenCivicData initiative have created a scraping framework that can get that information where available from Legistar sites such as NYC.

With respect to data storage, scaling out the scraping operation and providing a human-friendly way to access the data and docs, we have a few big picture plans:

  1. We're planning to use a Python library and cloud platform called Prefect in tandem with Kubernetes (GKE) to scale the scraping operation for agencies covered by our existing and future scrapers. We've prototyped that system and my colleaugue @palewire has even done a video presentation and a few blog posts detailing some of the technical aspects. I'm now part-way through our first ETL scraper to begin gathering metadata and docs, starting with the CivicPlus platform. Next up will be Legistar, Granicus, CivicClerk and PrimeGov. We're hoping to have the CivicPlus scraper running within the next few weeks, focusing on the Atlanta, Philly/PA and SF Bay as the initial coverage areas where we have existing news partnerships. Once CivicPlus is operational, we'll rinse and repeat that process with the other scrapers we have so far.
  2. We're about to begin work on a new web platform that provide reporters, academics and others the ability to easily search the document's we've gathered and subscribe to alerts when keyword searches on topics of interest are discovered in new documents. Similar in concept to DocumentCloud (where we're planning to also store documents since this tool is familiar to the journalism community), but with a bigger focus on the automated alerts so that journalists can discover newsworthy items when new files are discovered. To that end, we are indeed planning to harvest agendas for future meetings. We're initially expecting to check daily for new documents for meetings taking place in the upcoming 2 months or more, though we expect that time window will vary by site and how diligent agencies are about posting documents for future meetings. Our hope is to release the new platform built by end of summer.
  3. We plan to grant "raw" access to the corpus of documents we gather to data journalists, researchers, and other folks who want to perform entity extraction or other computational projects. We haven't thought through the precise data storage architecture and mechanism(s) for how we'll provide that access quite yet, but it's definitely on the menu.

That's the big picture status of our current efforts. We're definitely open to collaborating on the open source civic-scraper framework. We're also expecting to convene a group of stakeholders to advise on designs for the new web platform, so you'd be welcome to participate in that as well once we ramp up on that process.

Let us know if you'd like to hop on a video chat in the next few weeks and we can discuss the roadmap and possible collaboration areas in more detail (feel free to DM me on Twitter). Meantime, thanks again for reaching out!

zstumgoren commented 2 years ago

@JacksonMaxfield Sorry, I should also mention that we haven't fully sorted through the precise data storage architecture for the ETL processes and related web platform. While the plans aren't finalized, we're expecting to build a system backed by Django, Postgres and ElasticSearch for the web platform. That work has not yet started so we don't have schemas or code to share quite yet. On the ETL front, we'll use GCP cloud storage to store raw file assets and likely a NoSQL solution to store document metadata given the variation in available data across agencies. We'll then pump some standardized subset of that metadata and the related file assets into the system backing our web platform. We'd definitely love to hear any advice you have on those fronts. Hit me up on Twitter and we can find a time to discuss in more detail. Thanks!

evamaxfield commented 2 years ago

Thanks for the quick response @zstumgoren!

At the moment, our scrapers primarily focus on supporting the ability to gather metadata and documents about local government meetings from Legistar, CivicPlus, and a few other platforms used by agencies. Specifically, we try to capture metadata such as the:

* name of the entity (e.g. City Council, Planning and Zoning Commission, etc)

* the time of the meeting

* documents associated with meetings such as agendas, minutes, agenda packets

And of course, the framework provides the ability to download those document assets.

We've discussed the idea of gathering meeting recordings (e.g. video or audio) where available as well, and wherever any of our scrapers don't currently support gathering those resources we'd definitely be open to updates that add that functionality. ...

Makes sense! Sounds like maybe we can use civic-scraper to get like an "initial scraper" up and running for new CDP instances and then if the maintainer wants to add the vote information in later they can do so as well (cc @dphoria -- keep in memory as "try except scraper generation for user in auto-deployment πŸ˜‚")

Data storage and scaling

We use Prefect as well but overload GitHub Actions to act as our processor since our deployment model is a bit different. Only time I run the pipelines locally / with larger date ranges is when I am backfilling data for large time periods. Even then, the GitHub Actions UI makes for a pleasant backfill experience too. Really just allows for parallel processing and DAG documentation for us 🀷.

We're about to begin work on a new web platform that provide reporters, academics and others the ability to easily search the document's we've gathered and subscribe to alerts when keyword searches on topics of interest are discovered in new documents. ...

Search is an interesting one. Doing it cheaply with very meeting large transcripts is hard (elasticsearch and others have limits on how large a document is allowed to be, to my knowledge anyway). We have a system in place that works for now but something I keep having to patch every once in a while. We are also starting a notifications feature right now!

We plan to grant "raw" access to the corpus of documents we gather to data journalists, researchers, and other folks who want to perform entity extraction or other computational projects. We haven't thought through the precise data storage architecture and mechanism(s) for how we'll provide that access quite yet, but it's definitely on the menu.

cdp-data is our initial attempt at this sort of thing (a higher level API from direct database access anyway). Wanted to see what your plans were for access and analysis.


Really sounds like we should probably schedule a meeting to chat. I'll reach out via Twitter haha

evamaxfield commented 2 years ago

Going to close this! It was nice chatting with you!! Similar to your "random ideas" doc you shared, we are compiling a list of "known municipal meeting datasets / companies." Planning on using it to potentially get a grant for us all to combine our datasets somehow πŸ€·β€β™€οΈ

Any updates on your side? Any massive pipeline deployments + website to explore?

Regardless of above, do you have an idea of how many municipalities you will be targeting initially?