CrayLabs / SmartSim

SmartSim Infrastructure Library.
BSD 2-Clause "Simplified" License
228 stars 36 forks source link

Prevent launchers to launch incompatible `Step` objects #594

Open al-rigazzi opened 4 months ago

al-rigazzi commented 4 months ago

Description

Launchers should raise an error if users attempt to run an incompatible Step.

Justification

Launchers determine the correct way of launching a Step through its type. For example, the main branching of the SlurmLauncher.run() method looks like this:

        if isinstance(step, SbatchStep):
            # wait for batch step to submit successfully
            return_code, out, err = self.task_manager.start_and_wait(cmd_list, step.cwd)
            if return_code != 0:
                raise LauncherError(f"Sbatch submission failed\n {out}\n {err}")
            if out:
                step_id = out.strip()
                logger.debug(f"Gleaned batch job id: {step_id} for {step.name}")

        # Launch a in-allocation or on-allocation (if srun) command
        elif isinstance(step, SrunStep):
            task_id = self.task_manager.start_task(cmd_list, step.cwd)
        else:
            # MPI/local steps don't direct output like slurm steps
            out, err = step.get_output_files()

            # pylint: disable-next=consider-using-with
            output = open(out, "w+", encoding="utf-8")
            # pylint: disable-next=consider-using-with
            error = open(err, "w+", encoding="utf-8")
            task_id = self.task_manager.start_task(
                cmd_list, step.cwd, step.env, out=output.fileno(), err=error.fileno()
            )

The last catch-all else branch will try to run any Step which is not a SrunStep or an SbatchStep as some type of mpirun step. A stronger type checking should be enforced to avoid user-errors stemming from wrong script portings (e.g. running a JsrunStep through the SlurmLauncher.

Implementation Strategy

All if/elif branches should ensure a compatible step is being launched, all compatible steps should be covered by the branches (mpirun steps should be identified in an extensible way). The else branch should raise an error.