Closed mtitov closed 9 months ago
Please list target machines here
@wilke @ketancmaheshwari @okilic1 can you please list here target machines you are working on and with which pipelines (pip, conda, spack)
p.s. also please use env variable SITE
instead of HOST
in all yml-configs (at least conda uses env variable HOST
and our reporting script did send a wrong site_id
to the dashboard) - I've fixed it for LLNL related configs (https://github.com/ExaWorks/SDK/pull/198, not yet merged)
Targeting:
ANL: Polaris
depends on #203 and #202
@RamonAra209 will help @ketancmaheshwari and @okilic1 with OLCF configs: pip- and conda-pipelines should be tested first, and spack-pipeline will be at the end (@MishaZakharchanka works on fixing LLNL spack-pipeline, results of that will be expanded to other facilities)
I tested pip pipeline on Summit it is building I will work with @RamonAra209 to create a PR
#Setup Stage
export PIP_WORK_DIR=<need to be fixed>
export VENV_ENV_NAME=Exaworks_pip
export EXAWORKS_SDK_DIR=${PIP_WORK_DIR}/SDK
export PIP_REQUIREMENTS=${EXAWORKS_SDK_DIR}/.gitlab/pip-requirements.txt
mkdir -p ${PIP_WORK_DIR}
test -d ${PIP_WORK_DIR}/${VENV_ENV_NAME} && exit 0
python3 -m venv ${PIP_WORK_DIR}/${VENV_ENV_NAME}
source ${PIP_WORK_DIR}/${VENV_ENV_NAME}/bin/activate
#if there is a python package install first needs to removed.
module rm python*
module add python/3.8.10
pip install -U pip setuptools wheel
pip cache purge
*finalize
#Build Stage
#if second time:
#source ${PIP_WORK_DIR}/${VENV_ENV_NAME}/bin/activate
#if first time: (this was only needed to pull the .gitlab/pip-requirements.txt correctly
cd ${PIP_WORK_DIR}
git clone git@github.com:ExaWorks/SDK.git #This may not be needed.
#always
pip install --no-cache-dir -r ${PIP_REQUIREMENTS}
can we have a status update regarding ongoing work
anl-ci-conda
*-ci-spack
p.s. with spack we have a common issue among different facilities, but Ramon and Misha look into LLNL machines first
opened #217 (draft) for nersc pip ci, setup and build steps work fine manually.
nersc gitlab ci config is wip
opened #219 (draft) for fixups to pip and conda on lassen
opened #221 for nersc conda ci, setup and build steps work fine manually.
nersc gitlab ci just needs runners setup for perlmutter
for everyone involved - we need to finalize these tasks during coming two weeks (by Dec 8th)
@wilke please add anl-ci-conda
and update -pip
the same way it is done for LLNL (e.g., llnl-ci-pip
, llnl-ci-conda
)
@j-woz can you please work together with @RamonAra209 on the issue related to Swift-T on LLNL machines (#174)
@ketancmaheshwari @okilic1 can you please try manual deployment of spack-pipeline (#220) on Ascent and debug issues happening there
@cowanml will merge conda-pipeline, and will try to setup corresponding runners (this can be reused for ORNL pipelines to set runners for Summit and/or Crusher and/or Frontier, but it is with lower priority)
@mtitov I tested the Spack build. I am not getting any errors but the build has been killed during Rust
==> Installing rust-1.73.0-a7vw4opbhqcgheyzexta4mpro4wa6spo [37/37]
==> No binary for rust-1.73.0-a7vw4opbhqcgheyzexta4mpro4wa6spo found: installing from source
==> Fetching https://static.rust-lang.org/dist/rustc-1.73.0-src.tar.gz
==> No patches needed for rust
==> rust: Executing phase: 'configure'
==> rust: Executing phase: 'build'
Killed```
Summary:
ExaWorks SDK CI was deployed and configured to run on HPC platforms from four computing facilities: ALCF/ANL (Polaris), LLNL (Lassen, Quartz, Ruby), NERSC (Perlmutter) and OLCF/ORNL (Ascent/Summit). Its goal is to test ExaWorks SDK components on different architectures and within different execution environments.
Testing comprises each component deployment including resolving dependencies incompatibility with other components, and running a base test example per component. We used GitLab infrastructure to conduct these operations, thus we describe the whole testing procedure as a GitLab Pipeline. We provided three pipelines per facility, where each pipeline uses a particular package manager: pip, conda and spack. While we tried to enable all three pipelines for each facility, only two of them (pip and conda) were resistant to the environment changes. Spack-pipeline was tuned and enabled for LLNL platforms only.
GitLab infrastructure allowed us to automate pipeline runs and schedule them to run daily. For most facilities, it also provides default GitLab runners (a.k.a. agents) used to run pipelines on corresponding platforms. NERSC facility doesn't provide such runners, and we were able to set daily runs with gitlab-runner package and cron tool using crontab.
Results about the status of the runs were collected in the Dashboard (originally developed for the PSI/J testing infrastructure and was extended to include SDK testings).
minor correction: the NERSC implementation used slurm's scrontab
.
While working on the scrontab implementation, occurred to me:
While allocating an entire node for tests is easy and sufficient, may be prohibitively expensive for ongoing daily testing? Even if sufficient node hours are available, it's likely a bit wasteful on most systems?
If trying to minimize daily resources consumed, the initial run, which sets up envs, may need more resources than subsequent runs? Would be nice to quantify required resources for the 2 cases to give people contributing tests a starting point towards minimizing cost?
Nersc also suggested getting a "collaboration account" setup to run recurring production tasks, enabling multiple people to monitor and maintain things. Seems like a good best practice to recommend for testing deployments at sites with such a capability. https://docs.nersc.gov/accounts/collaboration_accounts/
LLNL CI pipelines were set per package manager and jobs are independent for each facility machine (jobs failures for one machine don't affect jobs for other machines), thus I would propose to follow these examples (
.gitlab/llnl-*.yml
)Create corresponding PRs only after you did test setup process manually.
llnl-ci-pip
(use corresponding versions){conda,pip}-requirements.{yml,txt}
filesWith these examples config files, use the structure to configure your assigned machines.