biglocalnews / civic-scraper

Tools for downloading agendas, minutes and other documents produced by local government
https://civic-scraper.readthedocs.io
Other
44 stars 14 forks source link

Create Legistar scraper #9

Closed zstumgoren closed 2 years ago

zstumgoren commented 4 years ago

Create a LegistarSite scraper class that generates a CSV of values for each site. See description in ticket #8 for example format of expected output.

If possible, this scraper should use requests and bs4 instead of Selenium

chris-stock commented 4 years ago

We should distinguish what we mean by scraping here (i.e. what the scope of the standalone package will be). I propose: scraping means a function that takes the Legistar url and returns a data frame of all the meetings listed at that url, and any links to documents in those meetings. But downloading those documents would be outside the scope of scraping.

I guess this entails specifying a signature that we want our scraping packages to adhere to, and possibly specifying a standardized output data type - some an abstraction for "scraped site" or for "meeting entry" which any scraped website could be converted to. Then downstream processing can be agnostic of the format of the website.

DiPierro commented 4 years ago

Yes, that division of functions seems reasonable to me.

What should the download package be? downloader.py? And what would it do besides downloading? (For example, would it figure out whether a file has already been downloaded and not download it twice if so?)

On Wed, Jun 17, 2020 at 6:16 PM Chris Stock notifications@github.com wrote:

We should distinguish what we mean by scraping here (i.e. what the scope of the standalone package will be). I propose: scraping means a function that takes the Legistar url and returns a data frame of all the meetings listed at that url, and any links to documents in those meetings. But downloading those documents would be outside the scope of scraping.

As we move to include non-Legistar platforms, we may want to consider implementing an abstraction for "scraped site" or for "meeting entry" which any scraped website could be converted to. Then downstream processing can be agnostic of the format of the website.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/biglocalnews/agendawatch/issues/1#issuecomment-645710522, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGA53FKHSIHNW7JRLOKRHPTRXFTHJANCNFSM4N6NUXGA .

-- Amy DiPierro c. 201.788.2054 t. @amydipierro https://twitter.com/amydipierro Portfolio https://goo.gl/BCqguw LinkedIn https://www.linkedin.com/in/dipierro

chris-stock commented 4 years ago

Right now, downloading is handled in document_manager.py, which indeed checks whether the file already exists in S3 before adding it. The same file also handles adding documents to our DynamoDB table. The reason to keep that logic separate from the scraper is that we have our own backend for interfacing with S3 and DynamoDB, but other users of the scraper might want to do something else with the website data, like save the files locally to their computer. Issue biglocalnews/agendawatch#7 is a good place to talk about design and development of the downloading code.

zstumgoren commented 4 years ago

I generally think of scraping as the process of acquiring one or more file artifacts, at minimum, and storing them in a standard location with a standardized name, etc. That definition of "scraping" can (and often is) expanded to also include the process of extracting information from a file and transforming it (for example OCR'g a scanned image or converting a CSV's name and headers to standard formats, etc.).

Part of the goal for Big Local is to create generally useful open source code for journalists and the wider public. So in terms of writing scrapers, it would be great to build generic Python packages that can be used by anyone to acquire documents from Legistar, CivicPlus and any other platforms we discover along the way. This would likely entail creating a CLI-based tool that can be installed by standard packaging tools (pip, pipenv, etc.), while providing basic scraper classes that others can use in their own Python-based document pipelines.

Our own data gathering operation would run these scrapers, of course, but the scrapers themselves would not include code highly specific to our own project. This way, we can scratch our own itch while building something useful for others to use and potentially contribute to.

chris-stock commented 4 years ago

I made an issue to discuss the interface/specs of a scraper package: https://github.com/biglocalnews/legistar-scraper/issues/2

Let's continue the discussion about specs there.

zstumgoren commented 4 years ago

@chris-stock @DiPierro Heads up that I transferred this ticket, revamped the title and ticket description to reflect our recent conversations about expected output (per #8).

I've been experimenting with a non-Selenium approach and it's looking like it may be doable. I'm not all the way there yet, but I can post partially finished code to provide a sense of how it might be done.

zstumgoren commented 2 years ago

Legistar added by DataMade