Extract CivicPlus scraper into a proper package

chris-stock commented 4 years ago

Make the CivicPlus scraper available as an importable package.

DiPierro commented 4 years ago

Some questions before we do this:

Does it make sense to scrap this scraper and write a new one in Python? I'm wondering if importing a package written in R into a script written in Python is possible, and, if it's possible, if it's wise.
If we deem it okay to continue using R, I can separate the scraping functionality from the downloading functionality so that they are separate packages. If not, I can write two new packages for scraping and downloading in Python. Sound good?

chris-stock commented 4 years ago

On the first question, I'd be curious to get @zstumgoren's input. If it's easy and sufficient to use R, there will be ways to make that work. But, we should be clear on the functionality we demand of the scraper. Pulling links off the first page is different from pulling all links available, which is what the Legistar scraper does. There are edge cases - if we haven't run the scraper in a while, or if many documents are posted in one day - where just pulling links from the first page may cause us to miss documents we want. And, just pulling from the first page isn't enough if we want to build out a historical repository of agendas and minutes. So with these considerations in mind, it might be better to build out a scraper that pulls all historical documents, and this may affect the choice of programming languge depending on availability of packages.

On the second question, I wouldn't worry about downloading, since we have code to download the documents, add them to our database, extract text, and vectorize. What's more important is that we collectively agree on a standardized format for scraper results that can then be fed to our downloading code.

chris-stock commented 4 years ago

For your reference, here's an example output from the Legistar scraper. This is fed directly (as a Pandas dataframe) to our downloader code.

,city,date,committee,doc_format,url,doc_type
0,Hayward,2019-12-16,Hayward Youth Commission,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=750204&GUID=F1EFFE96-1D3B-4CF5-A4B6-046D41BC402A,Agenda
1,Hayward,2019-12-12,Planning Commission,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=729839&GUID=315AE98C-D729-4CCE-B4AB-0680E98AE87C,Agenda
2,Hayward,2019-12-12,Personnel Commission,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=714329&GUID=9FFE26A5-7AFB-43FB-90CE-61C5BA123DE9,Agenda
3,Hayward,2019-12-09,Homelessness-Housing Task Force,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=734306&GUID=854CF140-88DD-4163-B107-D06F449EAAE0,Agenda
4,Hayward,2019-12-05,Homelessness-Housing Task Force,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=682107&GUID=BE980F9A-0F52-4883-A139-A2A5DAFF7108,Agenda
5,Hayward,2019-12-05,City Council,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=743352&GUID=C676DE80-DF5E-4C8B-A385-DD4BFA4F0DEE,Agenda
6,Hayward,2019-12-04,Council Budget and Finance Committee,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=718571&GUID=53E8F44A-4144-401A-8469-61870F530B91,Agenda
7,Hayward,2019-12-03,City Council,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=736075&GUID=DD500914-EA51-4C20-AFE7-DCDAB8947DEA,Agenda
8,Hayward,2019-12-03,City Council,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=743832&GUID=DC31DF76-0219-498D-8512-24CCA95C76C4,Agenda

chris-stock commented 4 years ago

To be clear, I'm not saying that the format above is the right one to adopt going forward - but we do need a standardized interface between our various scrapers and the downloader, and this is what we currently use.

zstumgoren commented 4 years ago

@DiPierro @chris-stock Along the lines of what Chris was saying, I think we will need to write the scraper in a way that makes it flexible enough to scrape date ranges, backfill when necessary, etc. I think we can sit down and make a plan for this together. Porting the code to Python shouldn't be a big lift, and will set us up nicely for contributions from the rest of the BigLocal team, which is more widely comfortable with Python, so it would enable us to better support the code over the long term.

zstumgoren commented 4 years ago

To be clear, I'm not saying that the format above is the right one to adopt going forward - but we do need a standardized interface between our various scrapers and the downloader, and this is what we currently use.

+1 to the idea of creating a data format that our scrapers can produce. That sets a clear notion of what they should be outputting. We may also want to discuss what types of functionality our scrapers should support, along the lines of what @chris-stock mentioned (ability to backfill, etc.), though we can treat that as a separate question/ticket

chris-stock commented 4 years ago

I made an issue to discuss the interface/specs of a scraper package: biglocalnews/legistar-scraper#2

Let's continue the discussion about specs there.

chris-stock commented 4 years ago

Quick update here. I've written a lambda function to call the scraper in the following way. In particular, the CivicScraper class has a method scrape_to_csv which writes the metadata from the site to a local .csv file. The .csv should look something like the format I pasted above.

Let me know how this needs to be tweaked, if at all. (Note, I haven't included the full lambda function below, just the part that calls the scraper.)

from datetime import datetime

class CivicScraper(object):
    def __init__(self, *args, **kwargs):
        pass

    def scrape_to_csv(self, output_path):
        """
        output_path is the local path to write the document list to as .csv 

        returns 'success' if all goes well, otherwise some form of error message
        """
        return 'success'

def download_document_list(site_id, endpoint, scraper_type, **scraper_args):
    """
    site_id:str is the unique identifier of the site to scrape
    endpoint:str is the URL to point the scraper to
    scraper_type:str is the type of scraper to call ('legistar' or 'civicplus')
    **scraper_args may contain additional filters such as start_date, end_date, etc.
    """

    # make result dict
    results = {'timestamp': datetime.utcnow().isoformat()}

    # initialize scraper
    try:
        scraper = CivicScraper(endpoint, scraper_type, **scraper_args)    
    except Exception as e:
        results['status'] = 'err_could_not_initialize_scraper: {}'.format(e)
        return results

    # download document list
    try:
        download_status = scraper.scrape_to_csv(local_path)
    except Exception as e:
        results['status'] = 'err_could_not_invoke_scraper: {}'.format(e)
        return results

    # report any errors that occurred during scraping
    if download_status!='success':
        results['status'] = 'err_during_scraping: {}'.format(download_status)
        return results

chris-stock commented 4 years ago

Seems to be redundant with #8

biglocalnews / civic-scraper

Extract CivicPlus scraper into a proper package #1