Police-Data-Accessibility-Project / scrapers

Code relating to scraping public police data.
https://pdap.io
GNU General Public License v3.0
157 stars 35 forks source link

Sf Court Scraper - V1 #201

Closed DouglasKrouth closed 1 year ago

DouglasKrouth commented 1 year ago

Implemented scraper to grab case number, title by date for request "#179 Big Local News : San Francisco Court Scraper".

thejqs commented 1 year ago

Hi, @DouglasKrouth! Thank you muchly for this. Before I dig into this too deeply, did you have a chance to see the note I left on the gist you sent containing your original work plan?

DouglasKrouth commented 1 year ago

Hello Jacob! I did review the comments that you left in the gist and I set up the scraper with the following process.

  1. Open a Selenium instance to get user to complete CAPTCHA - this generates a session ID/Session Token
  2. Once the token is obtained, use requests module to make GET requests for specific record dates using the webapp URL. This was setup instead of "selenium-ing" our way through the application, which would have been a fragile/tedious solution as you pointed out.
  3. If there is an issue with the session, a new CAPTCHA will pop up that the user can complete to generate a new session value.

Happy to set up time if you'd like to review this over call, otherwise I followed the guidelines that you laid out in the Gist comment :)

thejqs commented 1 year ago

@DouglasKrouth, wonderful. I'm at a conference and will be back Tuesday -- is there a time maybe Wednesday or Thursday that works well for you?

DouglasKrouth commented 1 year ago

@thejqs I'd be available either Wednesday or Thursday evening after 5 PM CST if that works for you? I'm also available on Discord ( Douglas Krouth#6003 ) if you'd like to chat directly, exchange contact info.

thejqs commented 1 year ago

Great, @DouglasKrouth. Let's shoot for Wednesday evening and I'll find you on Discord to coordinate.

josh-chamberlain commented 1 year ago

@thejqs I'm in favor of including this standalone scraper in our own repo, so we can retain that value in addition to retrofitting it to BLN's structure. Thoughts?

thejqs commented 1 year ago

I feel like that's asking a lot of @DouglasKrouth, because it needs to be quite different to conform to what Big Local wants it to be, and will not have access to any utilities or other orchestration or tools in their system if a copy lives with us. If he's up for reopening this, I'm happy to help figure out a way to make it work.

DouglasKrouth commented 1 year ago

@thejqs I am totally open to working with both repos (PDAP and BLN) if there's benefit in having the scraper made accessible across the two groups. The initial reason for reopening this issue that I posed to Josh was to provide this scraper in it's current form as it was moderately "complete" regarding the functionality I set up. That being said, the last thing I want to to do is create an extra piece of maintenance on the PDAP repo solely to persist a tool (the SF scraper) that would only be used by BLN/BLN stakeholders going forward.

Open to any suggestions on this, I had just reached out to @josh-chamberlain to double check whether there'd be a value add to the PDAP repo since this standalone version of the SF court scraper (without CAPTCHA automation, BLN formatting) was setup already.