NWChemEx / ParallelZone

You're travelling through another dimension, a dimension not only of CPUs and threads but of GPUs; a journey into a wondrous land whose boundaries are bandwidth limited. That's the signpost up ahead - your next stop, the ParallelZone!
https://nwchemex.github.io/ParallelZone/
Apache License 2.0
0 stars 1 forks source link

Rebase CI onto Docker Image #107

Closed ryanmrichard closed 1 year ago

ryanmrichard commented 1 year ago

This issue is more or less the acceptance test for NWChemEx-Project/.github#81.

Once we have the "base image" called for in NWChemEx-Project/.github#81, we want to modify the various CI workflows in this repo so they start from that image. In particular, this repo defines

AFAIU GitHub actions, it's all ultimately container based. By default each run of GitHub actions starts from some base image and then runs the workflow defined in the yaml files on top of that default base image. What we're really after here is to change from the default base image, to the one maintained in .github. So we want the aforementioned workflows to start from the base image, but otherwise remain the same. This means:

We note that during the PR process there is no need for us to explicitly create and publish images because most of these images will be garbage on account of: PRs failing, containing lots of extra artifacts, being non-optimized code (usually compiled in debug mode), and containing lots of different configurations (which we want to test here, but do not want to forward to the next rung of repos as doing so would result in a combinatorial explosion). During the PR process the base image is purely one mechanism for speeding up the CI.

yzhang-23 commented 1 year ago

One question: since each repo has different dependencies, what should we put into this basic image? If the image is too "basic", which means there are only a few of packages in it, we cannot save too much time in building, but it is also not optimal to keep a gigantic image which contains almost very dependency, right?

ryanmrichard commented 1 year ago

For now just worry about the dependencies of ParallelZone. The idea is to piggy-back up the stack. So, for example, when PluginPlay pulls the ParallelZone image resulting from #109, it'll get the dependencies ParallelZone needed, plus ParallelZone. Similarly, when SimDE pulls release images of PluginPlay it'll get PluginPlay, PluginPlay's dependencies (including ParallelZone and its dependencies). And so on, up the stack.

yzhang-23 commented 1 year ago

For now just worry about the dependencies of ParallelZone. The idea is to piggy-back up the stack. So, for example, when PluginPlay pulls the ParallelZone image resulting from #109, it'll get the dependencies ParallelZone needed, plus ParallelZone. Similarly, when SimDE pulls release images of PluginPlay it'll get PluginPlay, PluginPlay's dependencies (including ParallelZone and its dependencies). And so on, up the stack.

So, you mean for every repo, we pull some image, either the basic image or the release image of another repo, add missing dependencies if necessary, then build/test (maybe also install?) the repo, finally we build a release image for this repo? Do I get the workflow correctly? Thanks.

yzhang-23 commented 1 year ago

Another question I want to discuss: in the actions

ryanmrichard commented 1 year ago

So, you mean for every repo, we pull some image, either the basic image or the release image of another repo, add missing dependencies if necessary, then build/test (maybe also install?) the repo, finally we build a release image for this repo? Do I get the workflow correctly? Thanks.

Yeah that's right. FWIW, I think we only need to install the repo when we build the release image (unless we want to test that installation works, although to some extent building the release image will do that).

Another question I want to discuss: in the actions

  • add_licenses.yaml
  • c-cpp.yaml
  • deploy_docs.yaml
  • format.yaml
  • tag_and_release.yaml
  • test_docs.yaml only c-cpp.yaml, deploy_docs.yaml and test_docs.yaml have dependencies. Other actions would not benefit from a pre-built docker image. In addition, building/testing/deploying docs only need several packages installed, so we can build a very small image for the corresponding actions. Therefore, the real task here is to build docker images which is able to facilitate the building/testing of the repos, right? Please correct me if my understanding is wrong. Thanks.

You are correct; however, for a first pass it's probably easier to just use the same base image for all workflows in a particular repo. We can go back and optimize the base image for each workflow in a subsequent pass (and only if need be; if it's negligible overhead to use the full image, it's probably not worth our time to optimize this).

yzhang-23 commented 1 year ago

So, you mean for every repo, we pull some image, either the basic image or the release image of another repo, add missing dependencies if necessary, then build/test (maybe also install?) the repo, finally we build a release image for this repo? Do I get the workflow correctly? Thanks.

Yeah that's right. FWIW, I think we only need to install the repo when we build the release image (unless we want to test that installation works, although to some extent building the release image will do that).

Another question I want to discuss: in the actions

  • add_licenses.yaml
  • c-cpp.yaml
  • deploy_docs.yaml
  • format.yaml
  • tag_and_release.yaml
  • test_docs.yaml only c-cpp.yaml, deploy_docs.yaml and test_docs.yaml have dependencies. Other actions would not benefit from a pre-built docker image. In addition, building/testing/deploying docs only need several packages installed, so we can build a very small image for the corresponding actions. Therefore, the real task here is to build docker images which is able to facilitate the building/testing of the repos, right? Please correct me if my understanding is wrong. Thanks.

You are correct; however, for a first pass it's probably easier to just use the same base image for all workflows in a particular repo. We can go back and optimize the base image for each workflow in a subsequent pass (and only if need be; if it's negligible overhead to use the full image, it's probably not worth our time to optimize this).

My problem is: usually everything we do in a container stays in the container. I have to figure out how to make changes outside the container we are running, such as deploying and publishing. I know this is possible but I have searched the web and found very few people are talking about that.

ryanmrichard commented 1 year ago

Can't you can't just add something like:

runs-on: ubuntu-latest
    container:
      image: <base_image>

to the existing workflows like c-cpp.yaml (possibly temporarily bypassing the re-usable workflows in the .github repo)?

yzhang-23 commented 1 year ago

Can't you can't just add something like:

runs-on: ubuntu-latest
    container:
      image: <base_image>

to the existing workflows like c-cpp.yaml (possibly temporarily bypassing the re-usable workflows in the .github repo)?

For the building/testing task, doing it in a container is ok. I have already have an action ready for this (test pass in my forked repo but still fighting with the TOKEN issue for the NWChemEx repos); for the other tasks, such as doc publishing, currently I'm still looking for solutions inside a container.

ryanmrichard commented 1 year ago

What happens if you just put:

runs-on: ubuntu-latest
    container:
      image: <base_image>

at the top of deploy_docs.yaml?

yzhang-23 commented 1 year ago
- name: Deploy to GitHub Pages
        uses: peaceiris/actions-gh-pages@v3
        with:
          GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}
          publish_dir: ./docs/build/html

This step won't run because the container has no access to other existing actions on GitHub. I think it would be too complicated if I need git clone the action repo and run it inside the container. There could be better solutions but calling me dumb I haven't find them now.

ryanmrichard commented 1 year ago

This seems to suggest it is possible, although I don't see how to do it there. Still looking.

ryanmrichard commented 1 year ago

What's the actual error?

ryanmrichard commented 1 year ago

without the error, not sure if this is relevant, but maybe it's an issue with how the container is made.

yzhang-23 commented 1 year ago

This seems to suggest it is possible, although I don't see how to do it there. Still looking.

I think the example on this page is just to launch a container and run some simple steps (no other actions on GitHub are involved).

ryanmrichard commented 1 year ago

This seems to suggest it is possible, although I don't see how to do it there. Still looking.

I think the example on this page is just to launch a container and run some simple steps (no other actions on GitHub are involved).

https://stackoverflow.com/questions/63472909/github-actions-run-steps-in-container does run a GitHub action in a container though.

yzhang-23 commented 1 year ago

without the error, not sure if this is relevant, but maybe it's an issue with how the container is made.

I could not test deploy_docs.yaml in the NWChemEx project for the TOKEN issue, but for other actions in my forked branch which need resources from GitHub, they stopped for no access. Let me run a direct test on deploy_docs.yaml in my forked branch and will post the error message here.

yzhang-23 commented 1 year ago

This seems to suggest it is possible, although I don't see how to do it there. Still looking.

I think the example on this page is just to launch a container and run some simple steps (no other actions on GitHub are involved).

https://stackoverflow.com/questions/63472909/github-actions-run-steps-in-container does run a GitHub action in a container though.

Ok, I also tested that actions/checkout@v3 works in the container created from my docker image for doc building. This is a good news! I will continue to test other actions. Thanks for pointing out this to me!

yzhang-23 commented 1 year ago

Sorry, my previous understanding on running actions inside a container was WRONG: one CAN access GitHub actions inside a container. My previous failure of running actions inside a container have been attributed to some path issues (I'm working on fixing that).

ryanmrichard commented 1 year ago

Glad you figured it out!

yzhang-23 commented 1 year ago

Ok, it looks that not only me who has to fight with such path issues...please see this link. Finally someone suggested that one should replace ${{github.action_path}} with ${GITHUB_ACTION_PATH}, and it worked. In my eyes, this is a BLACK MAGIC!!!

yzhang-23 commented 1 year ago

@ryanmrichard Is it possible to create a special GitHub account and store the docker images into the registry associated with this account? I'm tackling the TOKEN issue.

ryanmrichard commented 1 year ago

What are you looking for, like a bot account?

ryanmrichard commented 1 year ago

If so, we pay for our organization's seats, so I don't think we want to create a bot account as part of the organization.

yzhang-23 commented 1 year ago

What are you looking for, like a bot account?

I'm currently facing such a TOKEN issue: with the provided token CONTAINER_REPO_TOKEN, I cannot push the docker image to the registry ghcr.io/${{github.actor}}. I think maybe we can store the images to a registry associated with a special account and we can generate a TOKEN to access the registry for all developers.

ryanmrichard commented 1 year ago

TL;DR I think we want the images stored with the NWChemEx-Project organization and CONTAINER_REPO_TOKEN should definitely now have sufficient permissions to do it.

Longer answer. Are you advocating for creating a bot user and storing the images in the bot user's personal registry? If so, that's probably even harder than using the NWChemEx-Project organization. First, we have to get the bot access to the private parts of NWChemEx-Project. Second, you'll have to make sure NWChemEx-Project can access the private parts of the bot user (we don't want images of private repos to be public). My guess is you'd also have to create two sets of CI (some for the bot and some for NWChemEx-Project). There's also potentially a financial issue if the private images become too large. I also think that having the images live with the NWChemEx-Project organization will instill more confidence to potential users.

Being blunt, I feel like you're probably doing something wrong. I gave CONTAINER_REPO_TOKEN read/write permissions to the NWChemEx-Project's container registry. Just in case those aren't the correct permissions, I have temporarily upgraded CONTAINER_REPO_TOKEN full admin privileges (it can do everything now). You can try again and see if that fixes it (obviously be very careful since you're more or less in sudo mode right now). If it doesn't, I don't think your problem is a permissions issue.

yzhang-23 commented 1 year ago

TL;DR I think we want the images stored with the NWChemEx-Project organization and CONTAINER_REPO_TOKEN should definitely now have sufficient permissions to do it.

Longer answer. Are you advocating for creating a bot user and storing the images in the bot user's personal registry? If so, that's probably even harder than using the NWChemEx-Project organization. First, we have to get the bot access to the private parts of NWChemEx-Project. Second, you'll have to make sure NWChemEx-Project can access the private parts of the bot user (we don't want images of private repos to be public). My guess is you'd also have to create two sets of CI (some for the bot and some for NWChemEx-Project). There's also potentially a financial issue if the private images become too large. I also think that having the images live with the NWChemEx-Project organization will instill more confidence to potential users.

Being blunt, I feel like you're probably doing something wrong. I gave CONTAINER_REPO_TOKEN read/write permissions to the NWChemEx-Project's container registry. Just in case those aren't the correct permissions, I have temporarily upgraded CONTAINER_REPO_TOKEN full admin privileges (it can do everything now). You can try again and see if that fixes it (obviously be very careful since you're more or less in sudo mode right now). If it doesn't, I don't think your problem is a permissions issue.

I'm not in favor of a bot account just in order to store the images. My problem is to find a right registry accessible to all developers in the organization --- the GitHub documentation only talks about registries associated with individual accounts ( I will appreciate greatly if someone can point out something I was missing to me). I think it is incorrect to store these images in any personal registry. So please downgrade the privilege of CONTAINER_REPO_TOKEN to a safer level (including read/write to packages). Let me continue to search for other solutions. Thanks.

ryanmrichard commented 1 year ago

FWIW here's the organization's registry: https://github.com/orgs/NWChemEx-Project/packages. The packages in there are associated with the org, not a user. As long as CONTAINER_REPO_TOKEN is being used by the CI of an organization owned by NWChemEx-Project, it should have sufficient privileges to read/write packages (I downgraded it back to just read/write packages to registry)

yzhang-23 commented 1 year ago

FWIW here's the organization's registry: https://github.com/orgs/NWChemEx-Project/packages. The packages in there are associated with the org, not a user. As long as CONTAINER_REPO_TOKEN is being used by the CI of an organization owned by NWChemEx-Project, it should have sufficient privileges to read/write packages (I downgraded it back to just read/write packages to registry)

Great! Is this a public or private registry (parallelzone_gcc11 is labelled as private now)? For private registries GitHub has volume restriction, right?

ryanmrichard commented 1 year ago

I would have assumed that public/private is done on a per-image basis, but I don't actually know.

yzhang-23 commented 1 year ago

I would have assumed that public/private is done on a per-image basis, but I don't actually know.

This is difficult...one image could > 2G, which exceeds the package size restriction for private repos. I don't want to accidentally create a financial issue during my CI testing. And also I don't find a way to set the package to be public. Let me do more research. Thanks.

ryanmrichard commented 1 year ago

There shouldn't be any financial issues. We have disabled spending for anything other than action time. So you should get rejected if you try to exceed our storage quota.

yzhang-23 commented 1 year ago

There shouldn't be any financial issues. We have disabled spending for anything other than action time. So you should get rejected if you try to exceed our storage quota.

Ok, this makes me feel better. Let me run some tests and I will update. Thanks.

ryanmrichard commented 1 year ago

Disregard my previous statement. It looks like the spending quota for actions is also used for storage. That said, it doesn't look like we're using any of our quota even though there's three packages...

yzhang-23 commented 1 year ago

Honestly speaking, I don't quite understand GitHub policies about these quota. Anyway, I will test with caution (try not create too many large images) and minimize the chance to get charged. I will try my best!

keceli commented 1 year ago

As far as I know GitHub hasn't started enforcing quota limitations for private repos since they want to promote ghcr. Here is the note from this link:

Billing update for container image storage: The period of free use for container image storage and bandwidth for the Container registry has been extended. If you are using Container registry you'll be informed at least one month in advance of billing commencing and you'll be given an estimate of how much you should expect to pay.

yzhang-23 commented 1 year ago

As far as I know GitHub hasn't started enforcing quota limitations for private repos since they want to promote ghcr. Here is the note from this link:

Billing update for container image storage: The period of free use for container image storage and bandwidth for the Container registry has been extended. If you are using Container registry you'll be informed at least one month in advance of billing commencing and you'll be given an estimate of how much you should expect to pay.

This will make me less nervous during my testing. Thanks for sharing!

yzhang-23 commented 1 year ago

@ryanmrichard When we build the repo with the pre-built docker image, which building mode should we use, debug or release? I think we should use release, but we may run into the slow building issue and this may consume too much of our action time, right?

ryanmrichard commented 1 year ago

Release mode shouldn't be a problem until you need to build Mokup. I would build the images in release mode.

yzhang-23 commented 1 year ago

@ryanmrichard Just want to confirm: the action to build the repo image (e. g., ParallelZone) should reside in the corresponding repo, not in .github, right?

ryanmrichard commented 1 year ago

@yzhang-23 correct, but I wouldn't worry about building an image for ParallelZone until all of the existing CI has been switched over to use the image stored in .github.

yzhang-23 commented 1 year ago

@yzhang-23 correct, but I wouldn't worry about building an image for ParallelZone until all of the existing CI has been switched over to use the image stored in .github.

So, you mean I should be working on building the dependency images for building all the repos now, instead of actually building the release images of repos?

ryanmrichard commented 1 year ago

No. I meant to just focus on switching all of ParallelZone's existing CI workflows over to the base image in the .github repo. Once you have done that, then it makes sense to build a release image for ParallelZone. After that we can move on to the next repo.

yzhang-23 commented 1 year ago

No. I meant to just focus on switching all of ParallelZone's existing CI workflows over to the base image in the .github repo. Once you have done that, then it makes sense to build a release image for ParallelZone. After that we can move on to the next repo.

Got it. I will do that. Thanks!

yzhang-23 commented 1 year ago

@ryanmrichard Currently I still see the building script fetching cppyy from scratch during its run. Should we put cppyy into the docker image? I think most repos need cppyy for python tests, right?

ryanmrichard commented 1 year ago

Yeah it's probably worth installing cppyy in the base image.

yzhang-23 commented 1 year ago

Yeah it's probably worth installing cppyy in the base image.

Ok, I will update the base image.

yzhang-23 commented 1 year ago

Yeah it's probably worth installing cppyy in the base image. Another question: currently the base image is based on ubuntu 20.04 and python3.8(?), so if we run "pip install cppyy", the installed version may not be 3.0 (the newest). Is this ok?

ryanmrichard commented 1 year ago

I think we must have 3.0 of cppyy. FWIW, I think the version pip pulls down is by default the newest version, unless you specify otherwise.

yzhang-23 commented 1 year ago

@ryanmrichard Just want to confirm: for the actions

although they have no dependencies, currently we still do them with the base docker image, right?

keceli commented 1 year ago

I think we must have 3.0 of cppyy. FWIW, I think the version pip pulls down is by default the newest version, unless you specify otherwise.

I think specifying versions would be safer to avoid surprises. pip install cppyy==3.0.0