Automate checking for new builds with GitHub action/similar

Currently the process of checking for new builds is somewhat of a 'manual assisted' process of browsing to the ChatGPT site, letting the chatgpt-web-app-script-update-notifier user script check if any of the script files had changed, then potentially reacting to that notification with more manual steps.

You can see the full manual steps outlined on this issue:

https://github.com/0xdevalias/chatgpt-source-watch/issues/7

But the core of the start of them are summarised below:

At a very high level, off the top of my head, my current process is basically:
Load ChatGPT and let my userscript check/notify me if there are any new script files

https://github.com/0xdevalias/userscripts/tree/main/userscripts/chatgpt-web-app-script-update-notifier

If there are new scripts, use the 'Copy ChatGPT Script data to clipboard' menu option on Tampermonkey
Run the following script to get a filtered list of the JSON (with dates) and list of URLs to be downloaded:
pbpaste | ./scripts/filter-urls-not-in-changelog.js --json-with-urls
etc
Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/7#issue-2152094993

Because the notifier currently only runs when the ChatGPT app is accessed, it is both easy to miss updates (eg. if updates happen but the ChatGPT app isn't accessed), and easy to get distracted from the task that ChatGPT was originally being opened for by the fact that there is a new update (leading to a tantalising procrastination/avoidance 'treat' when the task at hand brings less dopamine)

The proposed solution would be to use GitHub actions or similar to schedule an 'update check' to happen at a regular interval (eg. once per hour). The following are some notes I made in an intial prompt to ChatGPT for exploring/implementing this:

https://chat.openai.com/c/fc2d3037-968d-451f-b590-179ed6e824e0 (🔒 private chat)

Can you plan out and create a github action that will:

- run on a schedule (eg. every 1hr)
- check the HTML on a specified webpage and extract some .js script URLs related to a bundled webpack/next.js app
- check (against a cache or similar? not sure of the best way to implement this on github actions) if those URLs have been previously recorded
- if they are new URLs, notify the user and/or kick off further processing (this will probably involve executing one or more scripts that will then download/process the URLs)

That describes the most basic features this should be able to handle (off the top of my head), but the ideal plan is that the solution will be expandable to be able to handle and automate more of the process in future. Some ideas for future features would be:

being able to open a Pull Request for each new build, that contains the downloaded files, and the results of various scripts being run on them. This PR would also serve as an interface to prompt the user with any manual actions that are required of them, and some 'bot commands'/workflow for finalising the updates to the CHANGELOG/etc (eg. rebase the PR)
etc

https://github.com/actions/cache
- Cache dependencies and build outputs in GitHub Actions
- https://github.com/actions/cache/blob/main/tips-and-workarounds.md#update-a-cache
- Update a cache A cache today is immutable and cannot be updated. But some use cases require the cache to be saved even though there was a "hit" during restore. To do so, use a key which is unique for every run and use restore-keys to restore the nearest cache.
- https://github.com/actions/cache/blob/main/caching-strategies.md
https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows
- Caching dependencies to speed up workflows To make your workflows faster and more efficient, you can create and use caches for dependencies and other commonly reused files.
- To cache dependencies for a job, you can use GitHub's cache action. The action creates and restores a cache identified by a unique key. Alternatively, if you are caching the package managers listed below, using their respective setup-* actions requires minimal configuration and will create and restore dependency caches for you.
- Warning: Be mindful of the following when using caching with GitHub Actions: Anyone with read access can create a pull request on a repository and access the contents of a cache. Forks of a repository can also create pull requests on the base branch and access caches on the base branch.
- https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#comparing-artifacts-and-dependency-caching
- Comparing artifacts and dependency caching Artifacts and caching are similar because they provide the ability to store files on GitHub, but each feature offers different use cases and cannot be used interchangeably.
  - Use caching when you want to reuse files that don't change often between jobs or workflow runs, such as build dependencies from a package management system.
  - Use artifacts when you want to save files produced by a job to view after a workflow run has ended, such as built binaries or build logs.
- https://docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts
  - Storing workflow data as artifacts Artifacts allow you to share data between jobs in a workflow and store data once that workflow has completed.
  - GitHub provides two actions that you can use to upload and download build artifacts. For more information, see the upload-artifact and download-artifact actions.
    
    To share data between jobs:
    - Uploading files: Give the uploaded file a name and upload the data before the job ends.
    - Downloading files: You can only download artifacts that were uploaded during the same workflow run. When you download a file, you can reference it by name.
- https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#restrictions-for-accessing-a-cache
Restrictions for accessing a cache Access restrictions provide cache isolation and security by creating a logical boundary between different branches or tags. Workflow runs can restore caches created in either the current branch or the default branch (usually main). If a workflow run is triggered for a pull request, it can also restore caches created in the base branch, including base branches of forked repositories. For example, if the branch feature-b has the base branch feature-a, a workflow run triggered on a pull request would have access to caches created in the default main branch, the base feature-a branch, and the current feature-b branch.

Workflow runs cannot restore caches created for child branches or sibling branches. For example, a cache created for the child feature-b branch would not be accessible to a workflow run triggered on the parent main branch. Similarly, a cache created for the feature-a branch with the base main would not be accessible to its sibling feature-c branch with the base main. Workflow runs also cannot restore caches created for different tag names. For example, a cache created for the tag release-a with the base main would not be accessible to a workflow run triggered for the tag release-b with the base main.

When a cache is created by a workflow run triggered on a pull request, the cache is created for the merge ref (refs/pull/.../merge). Because of this, the cache will have a limited scope and can only be restored by re-runs of the pull request. It cannot be restored by the base branch or other pull requests targeting that base branch.

Multiple workflow runs in a repository can share caches. A cache created for a branch in a workflow run can be accessed and restored from another workflow run for the same repository and branch.

I did consider using a file within the repo for the 'previous URLs', but I don't want to clutter the commit history with this, and I also want the solution to be able to still work effectively even if it opened multiple Pull Requests (one for each new build), that weren't merged until some future point in time.

I'm not very familiar with GitHub actions, but if turns out that they're annoying to work with, do you think it could also be viable to use something like cron locally that checks for new files and then uses the GitHub CLI to open the PRs etc. ?

I did consider using a file within the repo for the 'previous URLs', but I don't want to clutter the commit history with this, and I also want the solution to be able to still work effectively even if it opened multiple Pull Requests (one for each new build), that weren't merged until some future point in time.

Would you still say that the commit history would be too cluttered when using a local file if it only gets updated as part of the same commit as the main source change? If the file is modified locally each time, then could it be agnostic to which PRs have been merged?

I'm not very familiar with GitHub actions, but if turns out that they're annoying to work with, do you think it could also be viable to use something like cron locally that checks for new files and then uses the GitHub CLI to open the PRs etc. ?

That would also be possible; but GitHub actions are pretty nice/easy to work with all in all. For a scheduled action I believe it even can use basically the cron scheduling syntax for setting the timing of it.

Would you still say that the commit history would be too cluttered when using a local file if it only gets updated as part of the same commit as the main source change?

If it was updated along with the main source change, then I don't believe it would satisfy the need for being able to work with multiple PR's open at once; as the first PR would have the first build; but then the second PR would have no way of knowing that that first build was already captured...

Unless the PR's are made 'chained' off one another I suppose. Though I wonder if that would complicate things.

If the file is modified locally each time, then could it be agnostic to which PRs have been merged?

My main thought here was probably to store that file in the GitHub action cache; so in a sense that would sort of make it a bit like a 'local file' independent of the PR's; but I would need to look into the semantics of how it handles a cache hit + updating the cache with a new value still. I believe I read that it's possible to both read from and write to the cache now; but it's not something I've used yet.

That would also be possible; but GitHub actions are pretty nice/easy to work with all in all. For a scheduled action I believe it even can use basically the from scheduling syntax for setting the timing of it.

Ah, okay, then that should be easier to manage without being tied to a specific machine.

If it was updated along with the main source change, then I don't believe it would satisfy the need for being able to work with multiple PR's open at once; as the first PR would have the first build; but then the second PR would have no way of knowing that that first build was already captured... Unless the PR's are made 'chained' off one another I suppose. Though I wonder if that would complicate things.

I was thinking in the context of having a local machine where the file is stored directly and where the pull requests are generated from: If the script runs and finds URL 1, it would add it to the local "previous URLs" file and submit a PR 1, which includes an update to the upstream version of that file. If you run the script later again and find URL 2, even if PR 1 isn't merged yet upstream, the file would have recorded URL 1 locally, and PR 2 would not duplicate its work.

But yeah, it would probably add unnecessary complexity compared to something centralized and automatically updated through GitHub, because if user A submits PR 1, a user B would then have to merge PR 1 locally to include its URL in their local file, before submitting their PR 2.

My main thought here was probably to store that file in the GitHub action cache; so in a sense that would sort of make it a bit like a 'local file' independent of the PR's; but I would need to look into the semantics of how it handles a cache hit + updating the cache with a new value still. I believe I read that it's possible to both read from and write to the cache now; but it's not something I've used yet.

Ok, then that is something good to look into. From my brief look, action cache seems more designed for accessing and then discarding and recreating, rather than making incremental updates to. E.g. a GitHub staff member in https://github.com/orgs/community/discussions/54404#discussioncomment-5804631 says that cache entries are discarded if not accessed for 7 days, which could be annoying. AFAICT artifacts would have a better interface, and are stored for longer (90 days).

But yeah, it would probably add unnecessary complexity compared to something centralized and automatically updated through GitHub

Yeah, I want to set things up to run within the repo/cloud infra rather than be tied to any individual's machine.

E.g. a GitHub staff member says that cache entries are discarded if not accessed for 7 days, which could be annoying

Yeah, that might be an issue. I think originally I was thinking that wouldn't matter so much if it's running on like a daily schedule anyway.

AFAICT artifacts would have a better interface, and are stored for longer (90 days).

It's been a while since I looked at it, so can't say for sure; but I believe when I was reading about artefacts, while they were also one of my first thoughts, there may have been a reason why I thought they wouldn't work after looking deeper. Possibly because while you can upload to them, you may not be able to download from them again from a different job or similar?

Pretty sure one of the links I referenced in the original post may talk more about it if I remember correctly.

Another feature I thought might be usable for it is the new 'GitHub repo KV store' sort of thing; I forget the specific name of it.

It's been a while since I looked at it, so can't say for sure; but I believe when I was reading about artefacts, while they were also one of my first thoughts, there may have been a reason why I thought they wouldn't work after looking deeper. Possibly because while you can upload to them, you may not be able to download from them again from a different job or similar?

Pretty sure one of the links I referenced in the original post may talk more about it if I remember correctly.

This comment above + the links and snippets it has within it contains a bunch of the relevant docs on cache + artefacts and the differences between them from when I first looked into this:

https://github.com/0xdevalias/chatgpt-source-watch/issues/8#issuecomment-1970444508

From that comment, this part is what I was referring to RE: sounding like not being able to download artefacts again from a different job/run:

docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts

Storing workflow data as artifacts Artifacts allow you to share data between jobs in a workflow and store data once that workflow has completed.

GitHub provides two actions that you can use to upload and download build artifacts. For more information, see the upload-artifact and download-artifact actions. To share data between jobs:

Uploading files: Give the uploaded file a name and upload the data before the job ends.

Downloading files: You can only download artifacts that were uploaded during the same workflow run. When you download a file, you can reference it by name.

Another feature I thought might be usable for it is the new 'GitHub repo KV store' sort of thing; I forget the specific name of it.

This is the key-value feature I was referring to; 'Repository Custom Properties':

Though it sounds like it's only available to organisations and not individual repositories (and the use case I was proposing would have definitely been a hack and not it's intended usage)

Could also potentially hack this sort of functionality using GitHub Action environment variables, but again, not really it's intended use case:

https://docs.github.com/en/actions/learn-github-actions/variables

Another idea I just had was that we could potentially use a GitHub gist or similar as the 'memory store' that is read from/written to (if cache/other options explored above aren't ideal).

Here's one arbitrary google result talking about the concept, though I've seen other places use/talk about it in the past as well:

https://dev.to/rikurouvila/how-to-use-a-github-gist-as-a-free-database-20np

And some arbitrary GitHub actions for reading/writing to gists (though we could also just do it directly with the API, or with some minimal code using the SDK probably):

https://github.com/sergeysova/gist-read-action
- The action to read content of a gist to an output.
- https://github.com/sergeysova/gist-read-action/blob/master/read.js
https://github.com/sergeysova/gist-write-action
- The action to write content to a gist from an input.
- https://github.com/sergeysova/gist-write-action/blob/master/index.js
https://github.com/exuanbo/actions-deploy-gist
- Deploy file to Github Gist
https://github.com/andymckay/get-gist-action
- This Action gets a Gist off of GitHub and writes it out to a file. It requires that the gist only contains one and only one file.
etc

I haven't thought about it too deeply.. but off the top of my head at this stage, I'm slightly leaning towards maybe using this method. We don't even really need to make the gist secret, as it could act as a somewhat standalone public 'memory' of ChatGPT builds in and of itself.

To simplify/avoid potential read/write conflicts to the 'history' file, we could probably just ensure the GitHub action can only run once at a time:

https://github.blog/changelog/2021-04-19-github-actions-limit-workflow-run-or-job-concurrency/
- GitHub Actions now supports a concurrency key at both the workflow and job level that will ensure that only a single run or job is in progress.
https://docs.github.com/en/actions/using-jobs/using-concurrency
- GitHub Actions also allows you to control the concurrency of workflow runs, so that you can ensure that only one run, one job, or one step runs at a time in a specific context.
- A concurrency group can be any string or expression.
- When a concurrent job or workflow is queued, if another job or workflow using the same concurrency group in the repository is in progress, the queued job or workflow will be pending. Any pending job or workflow in the concurrency group will be canceled. This means that there can be at most one running and one pending job in a concurrency group at any time.
- To also cancel any currently running job or workflow in the same concurrency group, specify cancel-in-progress: true. To conditionally cancel currently running jobs or workflows in the same concurrency group, you can specify cancel-in-progress as an expression with any of the allowed expression contexts.

We probably wouldn't need to run the cancel-in-progress part I wouldn't think.

I haven't thought about it too deeply.. but off the top of my head at this stage, I'm slightly leaning towards maybe using this method. We don't even really need to make the gist secret, as it could act as a somewhat standalone public 'memory' of ChatGPT builds in and of itself.

Based on the above, I think maybe using a gist as the 'history memory' could be a good way to approach this.

A good first basic prototype could just be to:

create a GitHub Action that runs on a schedule (eg. hourly)
read the previous 'memory'/'history' from a gist into the workflow run, probably saved as a local file/similar
create a script that will do a few steps similar to how my user script currently does it:
- https://github.com/0xdevalias/userscripts/tree/main/userscripts/chatgpt-web-app-script-update-notifier
- load/parse the ChatGPT index.html, and extract the build/script URLs
- merge/de-duplicate/sort these into the existing 'memory'/'history' the we loaded from the gist
- log newly identified builds/URLs (and do so in a way that we can trigger later jobs to run based on these new URLs)
- write the updated 'memory'/'history' file back to the gist

Off the top of my head, that would be enough of the 'bits and pieces' to prove the initial concept of automating things, and create some immediate value (if only basic), without having to automate the full process in it's entirety up front (and deal with all of the extra complexities that will bring with it)

0xdevalias / chatgpt-source-watch

Automate checking for new builds with GitHub action/similar #8

See Also