0xdevalias / chatgpt-source-watch

Analyzing the evolution of ChatGPT's codebase through time with curated archives and scripts
https://github.com/0xdevalias/chatgpt-source-watch/blob/main/CHANGELOG.md
Other
274 stars 16 forks source link

Automate checking for new builds with GitHub action/similar #8

Open 0xdevalias opened 8 months ago

0xdevalias commented 8 months ago

Currently the process of checking for new builds is somewhat of a 'manual assisted' process of browsing to the ChatGPT site, letting the chatgpt-web-app-script-update-notifier user script check if any of the script files had changed, then potentially reacting to that notification with more manual steps.

You can see the full manual steps outlined on this issue:

But the core of the start of them are summarised below:

At a very high level, off the top of my head, my current process is basically:

Originally posted by @0xdevalias in https://github.com/0xdevalias/chatgpt-source-watch/issues/7#issue-2152094993

Because the notifier currently only runs when the ChatGPT app is accessed, it is both easy to miss updates (eg. if updates happen but the ChatGPT app isn't accessed), and easy to get distracted from the task that ChatGPT was originally being opened for by the fact that there is a new update (leading to a tantalising procrastination/avoidance 'treat' when the task at hand brings less dopamine)

The proposed solution would be to use GitHub actions or similar to schedule an 'update check' to happen at a regular interval (eg. once per hour). The following are some notes I made in an intial prompt to ChatGPT for exploring/implementing this:

Can you plan out and create a github action that will:

- run on a schedule (eg. every 1hr)
- check the HTML on a specified webpage and extract some .js script URLs related to a bundled webpack/next.js app
- check (against a cache or similar? not sure of the best way to implement this on github actions) if those URLs have been previously recorded
- if they are new URLs, notify the user and/or kick off further processing (this will probably involve executing one or more scripts that will then download/process the URLs)

That describes the most basic features this should be able to handle (off the top of my head), but the ideal plan is that the solution will be expandable to be able to handle and automate more of the process in future. Some ideas for future features would be:

See Also

0xdevalias commented 8 months ago

I haven't explored the specifics of implementing this idea deeply yet; but I don't imagine it should be too complex.

One area I'm not currently sure of is the best way to implement the store of the 'previous URLs', so that we know if they have changed. Off the top of my head, I was thinking that maybe we could use GitHub action cache or similar for this, but there might be a better way:

I did consider using a file within the repo for the 'previous URLs', but I don't want to clutter the commit history with this, and I also want the solution to be able to still work effectively even if it opened multiple Pull Requests (one for each new build), that weren't merged until some future point in time.

michaelskyba commented 8 months ago

I'm not very familiar with GitHub actions, but if turns out that they're annoying to work with, do you think it could also be viable to use something like cron locally that checks for new files and then uses the GitHub CLI to open the PRs etc. ?

I did consider using a file within the repo for the 'previous URLs', but I don't want to clutter the commit history with this, and I also want the solution to be able to still work effectively even if it opened multiple Pull Requests (one for each new build), that weren't merged until some future point in time.

Would you still say that the commit history would be too cluttered when using a local file if it only gets updated as part of the same commit as the main source change? If the file is modified locally each time, then could it be agnostic to which PRs have been merged?

0xdevalias commented 8 months ago

I'm not very familiar with GitHub actions, but if turns out that they're annoying to work with, do you think it could also be viable to use something like cron locally that checks for new files and then uses the GitHub CLI to open the PRs etc. ?

That would also be possible; but GitHub actions are pretty nice/easy to work with all in all. For a scheduled action I believe it even can use basically the cron scheduling syntax for setting the timing of it.

Would you still say that the commit history would be too cluttered when using a local file if it only gets updated as part of the same commit as the main source change?

If it was updated along with the main source change, then I don't believe it would satisfy the need for being able to work with multiple PR's open at once; as the first PR would have the first build; but then the second PR would have no way of knowing that that first build was already captured...

Unless the PR's are made 'chained' off one another I suppose. Though I wonder if that would complicate things.

If the file is modified locally each time, then could it be agnostic to which PRs have been merged?

My main thought here was probably to store that file in the GitHub action cache; so in a sense that would sort of make it a bit like a 'local file' independent of the PR's; but I would need to look into the semantics of how it handles a cache hit + updating the cache with a new value still. I believe I read that it's possible to both read from and write to the cache now; but it's not something I've used yet.

michaelskyba commented 8 months ago

That would also be possible; but GitHub actions are pretty nice/easy to work with all in all. For a scheduled action I believe it even can use basically the from scheduling syntax for setting the timing of it.

Ah, okay, then that should be easier to manage without being tied to a specific machine.

If it was updated along with the main source change, then I don't believe it would satisfy the need for being able to work with multiple PR's open at once; as the first PR would have the first build; but then the second PR would have no way of knowing that that first build was already captured... Unless the PR's are made 'chained' off one another I suppose. Though I wonder if that would complicate things.

I was thinking in the context of having a local machine where the file is stored directly and where the pull requests are generated from: If the script runs and finds URL 1, it would add it to the local "previous URLs" file and submit a PR 1, which includes an update to the upstream version of that file. If you run the script later again and find URL 2, even if PR 1 isn't merged yet upstream, the file would have recorded URL 1 locally, and PR 2 would not duplicate its work.

But yeah, it would probably add unnecessary complexity compared to something centralized and automatically updated through GitHub, because if user A submits PR 1, a user B would then have to merge PR 1 locally to include its URL in their local file, before submitting their PR 2.

My main thought here was probably to store that file in the GitHub action cache; so in a sense that would sort of make it a bit like a 'local file' independent of the PR's; but I would need to look into the semantics of how it handles a cache hit + updating the cache with a new value still. I believe I read that it's possible to both read from and write to the cache now; but it's not something I've used yet.

Ok, then that is something good to look into. From my brief look, action cache seems more designed for accessing and then discarding and recreating, rather than making incremental updates to. E.g. a GitHub staff member in https://github.com/orgs/community/discussions/54404#discussioncomment-5804631 says that cache entries are discarded if not accessed for 7 days, which could be annoying. AFAICT artifacts would have a better interface, and are stored for longer (90 days).

0xdevalias commented 8 months ago

But yeah, it would probably add unnecessary complexity compared to something centralized and automatically updated through GitHub

Yeah, I want to set things up to run within the repo/cloud infra rather than be tied to any individual's machine.

E.g. a GitHub staff member says that cache entries are discarded if not accessed for 7 days, which could be annoying

Yeah, that might be an issue. I think originally I was thinking that wouldn't matter so much if it's running on like a daily schedule anyway.

AFAICT artifacts would have a better interface, and are stored for longer (90 days).

It's been a while since I looked at it, so can't say for sure; but I believe when I was reading about artefacts, while they were also one of my first thoughts, there may have been a reason why I thought they wouldn't work after looking deeper. Possibly because while you can upload to them, you may not be able to download from them again from a different job or similar?

Pretty sure one of the links I referenced in the original post may talk more about it if I remember correctly.

Another feature I thought might be usable for it is the new 'GitHub repo KV store' sort of thing; I forget the specific name of it.

0xdevalias commented 8 months ago

It's been a while since I looked at it, so can't say for sure; but I believe when I was reading about artefacts, while they were also one of my first thoughts, there may have been a reason why I thought they wouldn't work after looking deeper. Possibly because while you can upload to them, you may not be able to download from them again from a different job or similar?

Pretty sure one of the links I referenced in the original post may talk more about it if I remember correctly.

This comment above + the links and snippets it has within it contains a bunch of the relevant docs on cache + artefacts and the differences between them from when I first looked into this:

From that comment, this part is what I was referring to RE: sounding like not being able to download artefacts again from a different job/run:

docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts

  • Storing workflow data as artifacts Artifacts allow you to share data between jobs in a workflow and store data once that workflow has completed.

  • GitHub provides two actions that you can use to upload and download build artifacts. For more information, see the upload-artifact and download-artifact actions. To share data between jobs:

    • Uploading files: Give the uploaded file a name and upload the data before the job ends.
    • Downloading files: You can only download artifacts that were uploaded during the same workflow run. When you download a file, you can reference it by name.

Another feature I thought might be usable for it is the new 'GitHub repo KV store' sort of thing; I forget the specific name of it.

This is the key-value feature I was referring to; 'Repository Custom Properties':

Though it sounds like it's only available to organisations and not individual repositories (and the use case I was proposing would have definitely been a hack and not it's intended usage)

Could also potentially hack this sort of functionality using GitHub Action environment variables, but again, not really it's intended use case:


Another idea I just had was that we could potentially use a GitHub gist or similar as the 'memory store' that is read from/written to (if cache/other options explored above aren't ideal).

Here's one arbitrary google result talking about the concept, though I've seen other places use/talk about it in the past as well:

And some arbitrary GitHub actions for reading/writing to gists (though we could also just do it directly with the API, or with some minimal code using the SDK probably):

I haven't thought about it too deeply.. but off the top of my head at this stage, I'm slightly leaning towards maybe using this method. We don't even really need to make the gist secret, as it could act as a somewhat standalone public 'memory' of ChatGPT builds in and of itself.


To simplify/avoid potential read/write conflicts to the 'history' file, we could probably just ensure the GitHub action can only run once at a time:

We probably wouldn't need to run the cancel-in-progress part I wouldn't think.

0xdevalias commented 8 months ago

I haven't thought about it too deeply.. but off the top of my head at this stage, I'm slightly leaning towards maybe using this method. We don't even really need to make the gist secret, as it could act as a somewhat standalone public 'memory' of ChatGPT builds in and of itself.

Based on the above, I think maybe using a gist as the 'history memory' could be a good way to approach this.

A good first basic prototype could just be to:

Off the top of my head, that would be enough of the 'bits and pieces' to prove the initial concept of automating things, and create some immediate value (if only basic), without having to automate the full process in it's entirety up front (and deal with all of the extra complexities that will bring with it)