update Makefile and scrape_table.py to pull from updated data source

derekeder commented 2 weeks ago

current Makefile produces the following error:

Traceback (most recent call last):
  File "/home/user/Documents/chicago-elections-archive/scripts/scrape_table.py", line 69, in <module>
    main()
  File "/home/user/Documents/chicago-elections-archive/scripts/scrape_table.py", line 56, in main
    candidates = get_candidates(soup)
  File "/home/user/Documents/chicago-elections-archive/scripts/scrape_table.py", line 9, in get_candidates
    headers = [v.get_text().strip() for v in soup.find("table").find_all("b")]
AttributeError: 'NoneType' object has no attribute 'find_all'

Looks like the chicagoelections.gov has updated their pages that produce these results. https://chicagoelections.gov/en/election-results-specifics.asp is now a 404 page. This is the new one https://chicagoelections.gov/elections/results/

yashBhosale commented 2 weeks ago

Some color here: the new flow is that https://chicagoelections.gov/elections/results/ provides a menu of election links, then you go into an individual election which is just https://chicagoelections.gov/elections/results/{election_id} which then gives you a menu where you pick a race/ward/precinct and then it pulls up a table of the election data and a link to download that election data as an excel file. Incidentally, you can also get the excel file through https://chicagoelections.gov/elections/results/{election_id}?contest={contest_id}&ward={ward_number}&precinct={precinct_number}

This puts me in the rock/hard place of either using selenium to try to leverage the existing beautifulsoup code OR figuring out how to fetch/process the excel file as ergonomically as possible in the circumstances.

derekeder commented 2 weeks ago

@yashBhosale thanks for digging into this. I think if we can get the data in excel, lets go for it. Even if it requires a significant rewrite - the new code will be much less brittle and likely a lot simpler

yashBhosale commented 2 weeks ago

Yeah, that's what I was leaning towards as well. Secondary goal (or maybe an umbrella goal) will actually to be able to retire the makefile altogether, since it's actually only for building the data.

derekeder commented 2 weeks ago

cool. the makefile may still be useful if we want to chain some commands together, but we'll see.

for the excel data, we can make use of csvkit

nofurtherinformation commented 6 days ago

Hey, I was digging in to this a bit and found some insights. The Excel file that site outputs seems to be malformed in some way, and neither pandas.read_excel nor xlrd can read it from disk or bytes from the server.

An alternative approach could be to just use the HTML that gets returned from the Drupal endpoint, but we don't need selenium. Here's one naive and minimal example:

import requests
from bs4 import BeautifulSoup

contest_id = 205
# Drupal URL for html table
url = f"https://chicagoelections.gov/elections/results/{contest_id}?ajax_form=1&_wrapper_format=drupal_ajax"

# Form data for POST request
form_data = {
    'election_id': contest_id,
    'form_id': 'election_results_form',
}

# send POST request
response = requests.post(url, data=form_data)

# Returns a list of object, last one has data in html table
html_table = response.json()[-1]['data']

# Parse html table
table = BeautifulSoup(html_table, 'html.parser')

# TODO: Clean up table, make into Pandas DataFrame

This looks something like:

...
<thead>
<tr>
<th>Registered Voters</th>
<th>Ballots Cast</th>
<th>Turnout</th>
</tr>
</thead>
<tbody>
<tr>
<td>1,494,199</td>
<td>452,529</td>
<td>30.29%</td>
</tr>
...

chihacknight / chicago-elections-archive

update Makefile and scrape_table.py to pull from updated data source #1