ExaWorks / SDK

ExaWorks SDK
11 stars 12 forks source link

CI pipelines per facility (LLNL, OLCF/ORNL, ALCF/ANL) #195

Closed mtitov closed 9 months ago

mtitov commented 1 year ago

LLNL CI pipelines were set per package manager and jobs are independent for each facility machine (jobs failures for one machine don't affect jobs for other machines), thus I would propose to follow these examples (.gitlab/llnl-*.yml)

<facility>-ci.yml
|- <facility>-ci-conda.yml
|- <facility>-ci-pip.yml
|- <facility>-ci-spack.yml

Create corresponding PRs only after you did test setup process manually.

With these examples config files, use the structure to configure your assigned machines.

mtitov commented 1 year ago

Please list target machines here

mtitov commented 1 year ago

@wilke @ketancmaheshwari @okilic1 can you please list here target machines you are working on and with which pipelines (pip, conda, spack)

p.s. also please use env variable SITE instead of HOST in all yml-configs (at least conda uses env variable HOST and our reporting script did send a wrong site_id to the dashboard) - I've fixed it for LLNL related configs (https://github.com/ExaWorks/SDK/pull/198, not yet merged)

wilke commented 1 year ago

Targeting:

ANL: Polaris

depends on #203 and #202

mtitov commented 1 year ago

@RamonAra209 will help @ketancmaheshwari and @okilic1 with OLCF configs: pip- and conda-pipelines should be tested first, and spack-pipeline will be at the end (@MishaZakharchanka works on fixing LLNL spack-pipeline, results of that will be expanded to other facilities)

okilic1 commented 1 year ago

I tested pip pipeline on Summit it is building I will work with @RamonAra209 to create a PR

#Setup Stage 
export PIP_WORK_DIR=<need to be fixed>
export VENV_ENV_NAME=Exaworks_pip
export EXAWORKS_SDK_DIR=${PIP_WORK_DIR}/SDK
export PIP_REQUIREMENTS=${EXAWORKS_SDK_DIR}/.gitlab/pip-requirements.txt

mkdir -p ${PIP_WORK_DIR}
test -d ${PIP_WORK_DIR}/${VENV_ENV_NAME} && exit 0
python3 -m venv ${PIP_WORK_DIR}/${VENV_ENV_NAME}
source ${PIP_WORK_DIR}/${VENV_ENV_NAME}/bin/activate

#if there is a python package install first needs to removed.
module rm python*
module add python/3.8.10
pip install -U pip setuptools wheel
pip cache purge
*finalize

#Build Stage
#if second time:
#source ${PIP_WORK_DIR}/${VENV_ENV_NAME}/bin/activate
#if first time: (this was only needed to pull the .gitlab/pip-requirements.txt correctly
cd ${PIP_WORK_DIR}
git clone git@github.com:ExaWorks/SDK.git #This may not be needed. 
#always
pip install --no-cache-dir -r ${PIP_REQUIREMENTS}
mtitov commented 1 year ago

can we have a status update regarding ongoing work

p.s. with spack we have a common issue among different facilities, but Ramon and Misha look into LLNL machines first

cowanml commented 1 year ago

opened #217 (draft) for nersc pip ci, setup and build steps work fine manually.

nersc gitlab ci config is wip

cowanml commented 12 months ago

opened #219 (draft) for fixups to pip and conda on lassen

cowanml commented 11 months ago

opened #221 for nersc conda ci, setup and build steps work fine manually.

nersc gitlab ci just needs runners setup for perlmutter

mtitov commented 11 months ago

for everyone involved - we need to finalize these tasks during coming two weeks (by Dec 8th)

okilic1 commented 11 months ago

@mtitov I tested the Spack build. I am not getting any errors but the build has been killed during Rust


==> Installing rust-1.73.0-a7vw4opbhqcgheyzexta4mpro4wa6spo [37/37]
==> No binary for rust-1.73.0-a7vw4opbhqcgheyzexta4mpro4wa6spo found: installing from source
==> Fetching https://static.rust-lang.org/dist/rustc-1.73.0-src.tar.gz
==> No patches needed for rust
==> rust: Executing phase: 'configure'
==> rust: Executing phase: 'build'
Killed```
mtitov commented 9 months ago

Summary:

ExaWorks SDK CI was deployed and configured to run on HPC platforms from four computing facilities: ALCF/ANL (Polaris), LLNL (Lassen, Quartz, Ruby), NERSC (Perlmutter) and OLCF/ORNL (Ascent/Summit). Its goal is to test ExaWorks SDK components on different architectures and within different execution environments.

Testing comprises each component deployment including resolving dependencies incompatibility with other components, and running a base test example per component. We used GitLab infrastructure to conduct these operations, thus we describe the whole testing procedure as a GitLab Pipeline. We provided three pipelines per facility, where each pipeline uses a particular package manager: pip, conda and spack. While we tried to enable all three pipelines for each facility, only two of them (pip and conda) were resistant to the environment changes. Spack-pipeline was tuned and enabled for LLNL platforms only.

GitLab infrastructure allowed us to automate pipeline runs and schedule them to run daily. For most facilities, it also provides default GitLab runners (a.k.a. agents) used to run pipelines on corresponding platforms. NERSC facility doesn't provide such runners, and we were able to set daily runs with gitlab-runner package and cron tool using crontab.

Results about the status of the runs were collected in the Dashboard (originally developed for the PSI/J testing infrastructure and was extended to include SDK testings).

cowanml commented 9 months ago

minor correction: the NERSC implementation used slurm's scrontab.

While working on the scrontab implementation, occurred to me:

Nersc also suggested getting a "collaboration account" setup to run recurring production tasks, enabling multiple people to monitor and maintain things. Seems like a good best practice to recommend for testing deployments at sites with such a capability. https://docs.nersc.gov/accounts/collaboration_accounts/