CouncilDataProject / cdp-backend

Data storage utilities and processing pipelines used by CDP instances.
https://councildataproject.org/cdp-backend
Mozilla Public License 2.0
22 stars 26 forks source link

403 error on resource_copy #244

Open dvdokkum opened 8 months ago

dvdokkum commented 8 months ago

Describe the Bug

When running process-events job in the Event Gather workflow, there's a 403 error when trying to copy the video file.

Expected Behavior

The video file is publicly accessible (link in the logs below) so I would expect the script to not hit a 403.

Reproduction

This is for a cookie cutter CDP instance for the Legistar chapelhill client. Link to the failed run logs: https://github.com/triangleblogblog/cdp-ch/actions/runs/8516893019/job/23326788209

Error logs:


[INFO: file_utils: 203 2024-04-02 03:30:10,750] Beginning resource copy from: https://archive-video.granicus.com/chapelhill/chapelhill_bf9e87e2-e776-11ee-98bb-0050569183fa.mp4
/__w/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/urllib3/connectionpool.py:1061: InsecureRequestWarning: Unverified HTTPS request is being made to host 'archive-video.granicus.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
[ERROR: file_utils: 263 2024-04-02 03:30:10,836] Something went wrong during resource copy. Attempted copy from: 'https://archive-video.granicus.com/chapelhill/chapelhill_bf9e87e2-e776-11ee-98bb-0050569183fa.mp4', resulted in error.
[ERROR: task_runner: 910 2024-04-02 03:30:10,836] Task 'resource_copy_task': Exception encountered during task execution!
Traceback (most recent call last):
  File "/__w/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/prefect/engine/task_runner.py", line 880, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/prefect/utilities/executors.py", line 468, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
[2024-04-02 03:30:10+0000] ERROR - prefect.TaskRunner | Task 'resource_copy_task': Exception encountered during task execution!
Traceback (most recent call last):
  File "/__w/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/prefect/engine/task_runner.py", line 880, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/prefect/utilities/executors.py", line 468, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/cdp_backend/pipeline/event_gather_pipeline.py", line 272, in resource_copy_task
    return file_utils.resource_copy(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/cdp_backend/utils/file_utils.py", line 267, in resource_copy
    raise e
  File "/__w/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/cdp_backend/utils/file_utils.py", line 246, in resource_copy
    response.raise_for_status()
  File "/__w/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://archive-video.granicus.com/chapelhill/chapelhill_bf9e87e2-e776-11ee-98bb-0050569183fa.mp4
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/cdp_backend/pipeline/event_gather_pipeline.py", line 272, in resource_copy_task
    return file_utils.resource_copy(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/cdp_backend/utils/file_utils.py", line 267, in resource_copy
    raise e
  File "/__w/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/cdp_backend/utils/file_utils.py", line 246, in resource_copy
    response.raise_for_status()
  File "/__w/_tool/Python/3.11.8/x64/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://archive-video.granicus.com/chapelhill/chapelhill_bf9e87e2-e776-11ee-98bb-0050569183fa.mp4
[INFO: task_runner: 335 [202](https://github.com/triangleblogblog/cdp-ch/actions/runs/8516893019/job/23326788209#step:11:203)4-04-02 03:30:10,844] Task 'resource_copy_task': Finished task run for task with final state: 'Retrying'```

### Environment

<!-- Any additional information about your environment. -->

-   cdp-backend Version: latest
evamaxfield commented 8 months ago

I don't have a lot of time to look into this to be honest so going to have to just maybe point you in the correct direction rather than work on it :/ sorry about that.

We might need to add a way to add fake user agent stuff to the requests call: https://stackoverflow.com/a/27652558

Seems like there is a similar bug here: https://stackoverflow.com/a/56946001

dvdokkum commented 8 months ago

@evamaxfield I totally get that and no worries, direction pointing is also helpful!

I just tried tried to hit the video url (https://archive-video.granicus.com/chapelhill/chapelhill_bf9e87e2-e776-11ee-98bb-0050569183fa.mp4) locally with a simple requests.get call with and without a user agent specified and it didn't seem to make a difference, got 200 responses on both.

At this point I'm wondering:

  1. maybe the user-agent issue is only present when making the request from the vm
  2. maybe it isn't a UA issue, but something having to do with this urllib3 warning that is also appearing in the logs:

    InsecureRequestWarning: Unverified HTTPS request is being made to host 'archive-video.granicus.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings

I'm assuming the chapelhill granicus/legistar setup is pretty standard, so also wondering why this isn't an issue on other instances. Perhaps I set something up incorrectly in GCP...?

evamaxfield commented 8 months ago

Hmmm interesting...

Here is the relevant snippet of code: https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/utils/file_utils.py#L233

The InsecureRequestWarning is definitely from our verify=False usage. Iirc, we added this because we noticed that a lot of gov websites still weren't https and requests + http protocol was trying to do some fancy magic for us when really we just said "look if its this URL, just give it a try".

However I think at this point that single line of code to my knowledge only affects a single instance (Seattle). It might be worth a try to add the user agent stuff to that requests call and also remove the verify=False.

Worst case, I wonder if we can add a further try and except within that block that first tries with verify=True (or just no-parameter / default) and then an except that tries with verify=False.

I was about to say that the only other instance that may know anything about this is Asheville (they are the only active instance to my knowledge) but I just checked and they download their videos from youtube.


On the only other instance thing: most CDP instances have been shutdown. The only one I plan to leave running is Seattle. We just don't have the funding for them anymore + my primary research interests aren't in civic stuff anymore so just hard to justify the expenses every month. Because most instances are shutdown this issue may have been going unnoticed for a while.

dvdokkum commented 7 months ago

So I modified the logic on our instance to send a UA and use verify=True and I'm still getting a 403 on the request. Locally, I've tried a bunch of different things to try to replicate the 403 issue and I am unable to. Sending a UA in the header, using verify=true/false, etc. just doesn't seem to make a difference and I'm always able to get a 200. I have to assume at this point that it is an environment issue with GCP/Granicus, but I'm unsure how to debug something like that.

Granicus is publicly accessible, unless they're blocking the GCP IP range (which would be surprising)? The event-gather workflow/runner is clearly able to hit the public internet since it is downloading a bunch of deps...

Maybe the 403 is a red herring and there's something else going on, but generally unsure how to proceed with debugging.

evamaxfield commented 7 months ago

Sorry for the delay. You can try running the pipeline locally on your own machine as well so see if you get more insight that way.

The whole pipeline is just a python bin script after you install the package and if you have all of your GCP keys set up, everything should work. Minus any weird GPU issues and stuff.

Might be the best way to debug