Promptly-Technologies-LLC / rss-fetch-action

Github Action to scrape an RSS feed to display on a Github Pages website
MIT License
10 stars 0 forks source link

Error 403 when scraping Substack from an Actions runner #56

Open chriscarrollsmith opened 10 months ago

chriscarrollsmith commented 10 months ago

Scraping Substack with extractus works on a home PC, but it does not work from an Actions runner. For reasons I don't fully understand, Substack began returning Error 403: Forbidden at 7 PM EST on January 15, 2023. Here is a reproducible example:

name: Fetch RSS Feed

on:
  push:
    branches:
      - main

jobs:
  fetch-rss:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Fetch RSS Feed
      uses: Promptly-Technologies-LLC/rss-fetch-action@v2
      with:
        feed_url: https://knowledgeworkersguide.substack.com/feed
        file_path: ./feed.json
        parser_options: "{\"useISODateFormat\": false, \"getExtraEntryFields\": \"(feedEntry) => { return { 'content:encoded': feedEntry['content:encoded'] || '' }; }\"}"
        fetch_options: "{}"
        remove_published: true

    - name: Commit and push changes to repository
      uses: stefanzweifel/git-auto-commit-action@v4
      with:
        commit_message: 'Update RSS feed'
        file_pattern: '*.json'

I have tried adding custom headers, but without success.

chriscarrollsmith commented 10 months ago

Note that this is not an issue with extractus. Version 1 of the rss-fetch-action, which used isomorphic-fetch, also fails:

      - name: Fetch RSS Feed
        uses: Promptly-Technologies-LLC/rss-fetch-action@v1
        with:
          feed_url: https://babafaqirchand.substack.com/feed
          file_path: ./src/components/ui/RssFeed.json
          remove_last_build_date: true

I have also tried a Windows runner rather than an Ubuntu runner, but still got the same Error 403.

chriscarrollsmith commented 10 months ago

Honestly, it seems like Substack may have just specifically blocked Github Actions runners for some reason. I'm not sure why you would do this (maybe IP concerns about Substack content appearing on Github, or abusive high-frequency requests?) or how you would go about it (some kind of CORS/IP blocking?), but it's my current best guess.