Open dvdokkum opened 8 months ago
I don't have a lot of time to look into this to be honest so going to have to just maybe point you in the correct direction rather than work on it :/ sorry about that.
We might need to add a way to add fake user agent stuff to the requests call: https://stackoverflow.com/a/27652558
Seems like there is a similar bug here: https://stackoverflow.com/a/56946001
@evamaxfield I totally get that and no worries, direction pointing is also helpful!
I just tried tried to hit the video url (https://archive-video.granicus.com/chapelhill/chapelhill_bf9e87e2-e776-11ee-98bb-0050569183fa.mp4) locally with a simple requests.get
call with and without a user agent specified and it didn't seem to make a difference, got 200
responses on both.
At this point I'm wondering:
maybe it isn't a UA issue, but something having to do with this urllib3 warning that is also appearing in the logs:
InsecureRequestWarning: Unverified HTTPS request is being made to host 'archive-video.granicus.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
I'm assuming the chapelhill
granicus/legistar setup is pretty standard, so also wondering why this isn't an issue on other instances. Perhaps I set something up incorrectly in GCP...?
Hmmm interesting...
Here is the relevant snippet of code: https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/utils/file_utils.py#L233
The InsecureRequestWarning is definitely from our verify=False
usage. Iirc, we added this because we noticed that a lot of gov websites still weren't https and requests + http protocol was trying to do some fancy magic for us when really we just said "look if its this URL, just give it a try".
However I think at this point that single line of code to my knowledge only affects a single instance (Seattle). It might be worth a try to add the user agent stuff to that requests call and also remove the verify=False
.
Worst case, I wonder if we can add a further try and except within that block that first tries with verify=True
(or just no-parameter / default) and then an except that tries with verify=False
.
I was about to say that the only other instance that may know anything about this is Asheville (they are the only active instance to my knowledge) but I just checked and they download their videos from youtube.
On the only other instance thing: most CDP instances have been shutdown. The only one I plan to leave running is Seattle. We just don't have the funding for them anymore + my primary research interests aren't in civic stuff anymore so just hard to justify the expenses every month. Because most instances are shutdown this issue may have been going unnoticed for a while.
So I modified the logic on our instance to send a UA and use verify=True
and I'm still getting a 403 on the request. Locally, I've tried a bunch of different things to try to replicate the 403 issue and I am unable to. Sending a UA in the header, using verify=true/false, etc. just doesn't seem to make a difference and I'm always able to get a 200. I have to assume at this point that it is an environment issue with GCP/Granicus, but I'm unsure how to debug something like that.
Granicus is publicly accessible, unless they're blocking the GCP IP range (which would be surprising)? The event-gather workflow/runner is clearly able to hit the public internet since it is downloading a bunch of deps...
Maybe the 403 is a red herring and there's something else going on, but generally unsure how to proceed with debugging.
Sorry for the delay. You can try running the pipeline locally on your own machine as well so see if you get more insight that way.
The whole pipeline is just a python bin script after you install the package and if you have all of your GCP keys set up, everything should work. Minus any weird GPU issues and stuff.
Might be the best way to debug
Describe the Bug
When running
process-events
job in the Event Gather workflow, there's a 403 error when trying to copy the video file.Expected Behavior
The video file is publicly accessible (link in the logs below) so I would expect the script to not hit a 403.
Reproduction
This is for a cookie cutter CDP instance for the Legistar
chapelhill
client. Link to the failed run logs: https://github.com/triangleblogblog/cdp-ch/actions/runs/8516893019/job/23326788209Error logs: