Closed chris-stock closed 4 years ago
Some questions before we do this:
Does it make sense to scrap this scraper and write a new one in Python? I'm wondering if importing a package written in R into a script written in Python is possible, and, if it's possible, if it's wise.
If we deem it okay to continue using R, I can separate the scraping functionality from the downloading functionality so that they are separate packages. If not, I can write two new packages for scraping and downloading in Python. Sound good?
On the first question, I'd be curious to get @zstumgoren's input. If it's easy and sufficient to use R, there will be ways to make that work. But, we should be clear on the functionality we demand of the scraper. Pulling links off the first page is different from pulling all links available, which is what the Legistar scraper does. There are edge cases - if we haven't run the scraper in a while, or if many documents are posted in one day - where just pulling links from the first page may cause us to miss documents we want. And, just pulling from the first page isn't enough if we want to build out a historical repository of agendas and minutes. So with these considerations in mind, it might be better to build out a scraper that pulls all historical documents, and this may affect the choice of programming languge depending on availability of packages.
On the second question, I wouldn't worry about downloading, since we have code to download the documents, add them to our database, extract text, and vectorize. What's more important is that we collectively agree on a standardized format for scraper results that can then be fed to our downloading code.
For your reference, here's an example output from the Legistar scraper. This is fed directly (as a Pandas dataframe) to our downloader code.
,city,date,committee,doc_format,url,doc_type
0,Hayward,2019-12-16,Hayward Youth Commission,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=750204&GUID=F1EFFE96-1D3B-4CF5-A4B6-046D41BC402A,Agenda
1,Hayward,2019-12-12,Planning Commission,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=729839&GUID=315AE98C-D729-4CCE-B4AB-0680E98AE87C,Agenda
2,Hayward,2019-12-12,Personnel Commission,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=714329&GUID=9FFE26A5-7AFB-43FB-90CE-61C5BA123DE9,Agenda
3,Hayward,2019-12-09,Homelessness-Housing Task Force,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=734306&GUID=854CF140-88DD-4163-B107-D06F449EAAE0,Agenda
4,Hayward,2019-12-05,Homelessness-Housing Task Force,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=682107&GUID=BE980F9A-0F52-4883-A139-A2A5DAFF7108,Agenda
5,Hayward,2019-12-05,City Council,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=743352&GUID=C676DE80-DF5E-4C8B-A385-DD4BFA4F0DEE,Agenda
6,Hayward,2019-12-04,Council Budget and Finance Committee,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=718571&GUID=53E8F44A-4144-401A-8469-61870F530B91,Agenda
7,Hayward,2019-12-03,City Council,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=736075&GUID=DD500914-EA51-4C20-AFE7-DCDAB8947DEA,Agenda
8,Hayward,2019-12-03,City Council,pdf,https://hayward.legistar.com/View.ashx?M=A&ID=743832&GUID=DC31DF76-0219-498D-8512-24CCA95C76C4,Agenda
To be clear, I'm not saying that the format above is the right one to adopt going forward - but we do need a standardized interface between our various scrapers and the downloader, and this is what we currently use.
@DiPierro @chris-stock Along the lines of what Chris was saying, I think we will need to write the scraper in a way that makes it flexible enough to scrape date ranges, backfill when necessary, etc. I think we can sit down and make a plan for this together. Porting the code to Python shouldn't be a big lift, and will set us up nicely for contributions from the rest of the BigLocal team, which is more widely comfortable with Python, so it would enable us to better support the code over the long term.
To be clear, I'm not saying that the format above is the right one to adopt going forward - but we do need a standardized interface between our various scrapers and the downloader, and this is what we currently use.
+1 to the idea of creating a data format that our scrapers can produce. That sets a clear notion of what they should be outputting. We may also want to discuss what types of functionality our scrapers should support, along the lines of what @chris-stock mentioned (ability to backfill, etc.), though we can treat that as a separate question/ticket
I made an issue to discuss the interface/specs of a scraper package: biglocalnews/legistar-scraper#2
Let's continue the discussion about specs there.
Quick update here. I've written a lambda function to call the scraper in the following way. In particular, the CivicScraper
class has a method scrape_to_csv
which writes the metadata from the site to a local .csv file. The .csv should look something like the format I pasted above.
Let me know how this needs to be tweaked, if at all. (Note, I haven't included the full lambda function below, just the part that calls the scraper.)
from datetime import datetime
class CivicScraper(object):
def __init__(self, *args, **kwargs):
pass
def scrape_to_csv(self, output_path):
"""
output_path is the local path to write the document list to as .csv
returns 'success' if all goes well, otherwise some form of error message
"""
return 'success'
def download_document_list(site_id, endpoint, scraper_type, **scraper_args):
"""
site_id:str is the unique identifier of the site to scrape
endpoint:str is the URL to point the scraper to
scraper_type:str is the type of scraper to call ('legistar' or 'civicplus')
**scraper_args may contain additional filters such as start_date, end_date, etc.
"""
# make result dict
results = {'timestamp': datetime.utcnow().isoformat()}
# initialize scraper
try:
scraper = CivicScraper(endpoint, scraper_type, **scraper_args)
except Exception as e:
results['status'] = 'err_could_not_initialize_scraper: {}'.format(e)
return results
# download document list
try:
download_status = scraper.scrape_to_csv(local_path)
except Exception as e:
results['status'] = 'err_could_not_invoke_scraper: {}'.format(e)
return results
# report any errors that occurred during scraping
if download_status!='success':
results['status'] = 'err_during_scraping: {}'.format(download_status)
return results
Seems to be redundant with #8
Make the CivicPlus scraper available as an importable package.