University-of-Delaware-IT-RCI / auto_tmpdir

Slurm SPANK plugin for automated handling of temporary directories for jobs.
BSD 2-Clause "Simplified" License
4 stars 4 forks source link

shared tempdirs are not created #16

Open dirkpetersen opened 1 week ago

dirkpetersen commented 1 week ago

This happens on rocky linux 9, shared temp dirs are not created with

srun --use-shared-tmpdir --pty bash
cmake -DCMAKE_BUILD_TYPE=Release \
   -DAUTO_TMPDIR_ENABLE_SHARED_TMPDIR=On \
   -DAUTO_TMPDIR_DEFAULT_SHARED_PREFIX=/arc/scratch1/jobs \
   -DSLURM_MODULES_DIR=/usr/lib64/slurm \
   ..

in /etc/slurm/plugstack.conf

optional    auto_tmpdir.so  mount=/tmp mount=/var/tmp local_prefix=/mnt/scratch/tmpdir- shared_prefix=/arc/scratch1/slurm/job- no_rm_shared_only state_dir=/var/tmp/auto_tmpdir_cache
jtfrey commented 1 week ago

Have you enabled debug logging on slurmd — are there any messages logged that indicate what is failing?

dirkpetersen commented 1 week ago

It seems it never executes the code that is supposed to create the shared directory

slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:2 Boards:1 Sockets:2 CoresPerSocket:1 ThreadsPerCore:1
slurmd: debug4: CPU map[0]=>0 S:C:T 0:0:0
slurmd: debug4: CPU map[1]=>1 S:C:T 1:0:0
slurmd: debug:  CPUs has been set to match cores per node instead of threads CPUs=1:2(hw)
slurmd: error: Node configuration differs from hardware: CPUs=1:2(hw) Boards=1:1(hw) SocketsPerBoard=1:2(hw) CoresPerSocket=1:1(hw) ThreadsPerCore=1:1(hw)
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/cgroup_v2.so
slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Cgroup v2 plugin type:cgroup/v2 version:0x170b0a
slurmd: debug:  cgroup/v2: init: Cgroup v2 plugin loaded
slurmd: debug3: Success.
slurmd: debug3: _set_slurmd_spooldir: initializing slurmd spool directory `/var/spool/slurmd`
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
slurmd: debug:  CPUs:2 Boards:1 Sockets:2 CoresPerSocket:1 ThreadsPerCore:1
slurmd: debug4: CPU map[0]=>0 S:C:T 0:0:0
slurmd: debug4: CPU map[1]=>1 S:C:T 1:0:0
slurmd: debug3: _pack_context_buf: No GRES context count sent to stepd
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/topology_default.so
slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:topology Default plugin type:topology/default version:0x170b0a
slurmd: topology/default: init: topology Default plugin loaded
slurmd: debug3: Success.
slurmd: CPU frequency setting not configured for this node
slurmd: debug:  Resource spec: No specialized cores configured by default on this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
slurmd: debug3: NodeName    = localhost
slurmd: debug3: TopoAddr    = localhost
slurmd: debug3: TopoPattern = node
slurmd: debug3: ClusterName = cluster
slurmd: debug3: Confile     = `/etc/slurm/slurm.conf'
slurmd: debug3: Debug       = 3
slurmd: debug3: CPUs        = 1  (CF:  1, HW:  2)
slurmd: debug3: Boards      = 1  (CF:  1, HW:  1)
slurmd: debug3: Sockets     = 1  (CF:  1, HW:  2)
slurmd: debug3: Cores       = 1  (CF:  1, HW:  1)
slurmd: debug3: Threads     = 1  (CF:  1, HW:  1)
slurmd: debug3: UpTime      = 52699 = 14:38:19
slurmd: debug3: Block Map   = 0,1
slurmd: debug3: Inverse Map = 0,1
slurmd: debug3: ConfMemory  = 1
slurmd: debug3: PhysicalMem = 7495
slurmd: debug3: TmpDisk     = 203683
slurmd: debug3: Epilog      = `(null)'
slurmd: debug3: Logfile     = `/var/log/slurmd.log'
slurmd: debug3: HealthCheck = `(null)'
slurmd: debug3: NodeName    = localhost
slurmd: debug3: Port        = 6818
slurmd: debug3: Prolog      = `(null)'
slurmd: debug3: TmpFS       = `/tmp'
slurmd: debug3: Slurmstepd  = `/usr/sbin/slurmstepd'
slurmd: debug3: Spool Dir   = `/var/spool/slurmd'
slurmd: debug3: Syslog Debug  = 10
slurmd: debug3: Pid File    = `/var/run/slurmd.pid'
slurmd: debug3: Slurm UID   = 5829
slurmd: debug3: TaskProlog  = `(null)'
slurmd: debug3: TaskEpilog  = `(null)'
slurmd: debug3: TaskPluginParam = 0
slurmd: debug3: UsePAM      = 0
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/proctrack_cgroup.so
slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Process tracking via linux cgroup freezer subsystem type:proctrack/cgroup version:0x170b0a
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/task_affinity.so
slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:task affinity plugin type:task/affinity version:0x170b0a
slurmd: debug3: task/affinity: slurm_getaffinity: sched_getaffinity(0) = 0x3
slurmd: task/affinity: init: task affinity plugin loaded with CPU mask 0x3
slurmd: debug3: Success.
slurmd: debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
slurmd: debug3: plugin_peek->_verify_syms: found Slurm plugin name:auto_tmpdir type:spank version:0x170b0a
slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:auto_tmpdir type:spank version:0x170b0a
slurmd: debug3: Couldn't find sym 'slurm_spank_local_user_init' in the plugin
slurmd: debug3: Couldn't find sym 'slurm_spank_user_init' in the plugin
slurmd: debug3: Couldn't find sym 'slurm_spank_task_init_privileged' in the plugin
slurmd: debug3: Couldn't find sym 'slurm_spank_task_init' in the plugin
slurmd: debug3: Couldn't find sym 'slurm_spank_task_post_fork' in the plugin
slurmd: debug3: Couldn't find sym 'slurm_spank_task_exit' in the plugin
slurmd: debug3: Couldn't find sym 'slurm_spank_slurmd_exit' in the plugin
slurmd: debug3: Couldn't find sym 'slurm_spank_exit' in the plugin
slurmd: debug:  spank: /etc/slurm/plugstack.conf:1: Loaded plugin auto_tmpdir.so
slurmd: debug:  SPANK: appending plugin option "no-rm-tmpdir"
slurmd: debug:  SPANK: appending plugin option "use-shared-tmpdir"
slurmd: debug2: spank: auto_tmpdir.so: init = 0
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/cred_munge.so
slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge credential signature plugin type:cred/munge version:0x170b0a
slurmd: cred/munge: init: Munge credential signature plugin loaded
slurmd: debug3: Success.
slurmd: debug4: xsignal: Swap signal TERM[15] to 0x40cee3 from 0x0
slurmd: debug4: xsignal: Swap signal INT[2] to 0x40cee3 from 0x0
slurmd: debug4: xsignal: Swap signal HUP[1] to 0x40bc56 from 0x0
slurmd: debug4: xsignal: Swap signal USR2[12] to 0x40bc6c from 0x0
slurmd: debug3: slurmd initialization successful
slurmd: slurmd version 23.11.10 started
slurmd: debug3: finished daemonize
slurmd: debug3: create_mmap_buf: loaded file `/var/spool/slurmd/cred_state` as buf_t
slurmd: debug3: cred_unpack: job 103 ctime:1725380631 revoked:1725380631 expires:1725380751
slurmd: debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/prep_script.so
slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Script PrEp plugin type:prep/script version:0x170b0a
slurmd: debug3: Success.
slurmd: debug:  MPI: Loading all types
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/mpi_cray_shasta.so
slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:mpi Cray Shasta plugin type:mpi/cray_shasta version:0x170b0a
slurmd: debug3: Success.
slurmd: debug3: Trying to load plugin /usr/lib64/slurm/mpi_pmi2.so
slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:mpi PMI2 plugin type:mpi/pmi2 version:0x170b0a
slurmd: debug3: Success.
slurmd: debug2: No mpi.conf file (/etc/slurm/mpi.conf)
slurmd: debug3: Successfully opened slurm listen port 6818
slurmd: slurmd started on Tue, 03 Sep 2024 09:26:51 -0700
slurmd: CPUs=1 Boards=1 Sockets=1 Cores=1 Threads=1 Memory=7495 TmpDisk=203683 Uptime=52699 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x0
slurmd: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x0
slurmd: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x0
slurmd: debug4: xsignal: Swap signal PIPE[13] to 0x0 from 0x1
slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
slurmd: debug3: _registration_engine complete
slurmd: debug3: in the service_connection
slurmd: debug2: Start processing RPC: REQUEST_BATCH_JOB_LAUNCH
slurmd: debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 104
slurmd: debug3: task/affinity: _get_avail_map: slurmctld s 1 c 1; hw s 2 c 1 t 1
slurmd: debug3: task/affinity: _get_avail_map: StepId=104.batch core mask from slurmctld: 0x1
slurmd: debug3: task/affinity: _get_avail_map: StepId=104.batch CPU final mask for local node: 0x1
slurmd: task/affinity: batch_bind: job 104 CPU input mask for node: 0x1
slurmd: debug3: task/affinity: _lllp_map_abstract_masks: _lllp_map_abstract_masks
slurmd: task/affinity: batch_bind: job 104 CPU final HW mask for node: 0x1
slurmd: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x0
slurmd: debug4: xsignal: Swap signal PIPE[13] to 0x0 from 0x1
slurmd: debug:  prep/script: _run_spank_job_script: _run_spank_job_script: calling /usr/sbin/slurmstepd spank prolog
spank-prolog: debug2: debug level read from slurmd is 'unknown'.
spank-prolog: debug2: _read_slurmd_conf_lite: slurmd sent 8 TRES.
spank-prolog: debug:  Running spank/prolog for jobid [104] uid [1000] gid [1000]
spank-prolog: debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
spank-prolog: debug3: plugin_peek->_verify_syms: found Slurm plugin name:auto_tmpdir type:spank version:0x170b0a
spank-prolog: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:auto_tmpdir type:spank version:0x170b0a
spank-prolog: debug3: Couldn't find sym 'slurm_spank_local_user_init' in the plugin
spank-prolog: debug3: Couldn't find sym 'slurm_spank_user_init' in the plugin
spank-prolog: debug3: Couldn't find sym 'slurm_spank_task_init_privileged' in the plugin
spank-prolog: debug3: Couldn't find sym 'slurm_spank_task_init' in the plugin
spank-prolog: debug3: Couldn't find sym 'slurm_spank_task_post_fork' in the plugin
spank-prolog: debug3: Couldn't find sym 'slurm_spank_task_exit' in the plugin
spank-prolog: debug3: Couldn't find sym 'slurm_spank_slurmd_exit' in the plugin
spank-prolog: debug3: Couldn't find sym 'slurm_spank_exit' in the plugin
spank-prolog: debug:  spank: /etc/slurm/plugstack.conf:1: Loaded plugin auto_tmpdir.so
spank-prolog: debug:  SPANK: appending plugin option "no-rm-tmpdir"
spank-prolog: debug:  SPANK: appending plugin option "use-shared-tmpdir"
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_init: 104 for owner 1000:1000
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_init: no_rm_shared_only set, ensuring no should_not_delete bit in options
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_init: local_prefix=/mnt/scratch/tmpdir-
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_init: shared_prefix=/arc/scratch1/jobs/job-
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: created directory `/mnt/scratch/tmpdir-104/tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: set ownership 1000:1000 on directory `/mnt/scratch/tmpdir-104/tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: added bindpoint `/mnt/scratch/tmpdir-104/tmp` -> `/tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: created directory `/mnt/scratch/tmpdir-104/var_tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: set ownership 1000:1000 on directory `/mnt/scratch/tmpdir-104/var_tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: added bindpoint `/mnt/scratch/tmpdir-104/var_tmp` -> `/var/tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: created directory `/dev/shm/slurm-104`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: set ownership 1000:1000 on directory `/dev/shm/slurm-104`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: added bindpoint `/dev/shm/slurm-104` -> `/dev/shm`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_default_state_file: 104
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_default_state_file: state_dir=/var/tmp/auto_tmpdir_cache
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_serialize_to_file: serialized to `/var/tmp/auto_tmpdir_cache/auto_tmpdir_fs-104.cache`
spank-prolog: debug2: spank: auto_tmpdir.so: job_prolog = 0
slurmd: debug:  unsetenv (SPANK__SLURM_SPANK_OPTION_auto_tmpdir_use_shared_tmpdir)
slurmd: Launching batch job 104 for UID 1000
slurmd: debug3: _rpc_batch_job: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank -1 (localhost), parent rank -1 (NONE), children 0, depth 0, max_depth 0
slurmd: debug3: _rpc_batch_job: return from _forkexec_slurmstepd: 0
slurmd: debug2: Finish processing RPC: REQUEST_BATCH_JOB_LAUNCH
slurmd: debug3: in the service_connection
slurmd: debug2: Start processing RPC: REQUEST_TERMINATE_JOB
slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd: debug:  _rpc_terminate_job: uid = 5829 JobId=104
slurmd: debug:  credential for job 104 revoked
slurmd: debug2: No steps in jobid 104 to send signal 18
slurmd: debug2: No steps in jobid 104 to send signal 15
slurmd: debug4: sent SUCCESS
slurmd: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x0
slurmd: debug4: xsignal: Swap signal PIPE[13] to 0x0 from 0x1
slurmd: debug2: set revoke expiration for jobid 104 to 1725380980 UTS
slurmd: debug:  Waiting for job 104's prolog to complete
slurmd: debug:  Finished wait for job 104's prolog to complete
slurmd: debug:  prep/script: _run_spank_job_script: _run_spank_job_script: calling /usr/sbin/slurmstepd spank epilog
spank-epilog: debug2: debug level read from slurmd is 'unknown'.
spank-epilog: debug2: _read_slurmd_conf_lite: slurmd sent 8 TRES.
spank-epilog: debug:  Running spank/epilog for jobid [104] uid [1000] gid [1000]
spank-epilog: debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
spank-epilog: debug3: plugin_peek->_verify_syms: found Slurm plugin name:auto_tmpdir type:spank version:0x170b0a
spank-epilog: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:auto_tmpdir type:spank version:0x170b0a
spank-epilog: debug3: Couldn't find sym 'slurm_spank_local_user_init' in the plugin
spank-epilog: debug3: Couldn't find sym 'slurm_spank_user_init' in the plugin
spank-epilog: debug3: Couldn't find sym 'slurm_spank_task_init_privileged' in the plugin
spank-epilog: debug3: Couldn't find sym 'slurm_spank_task_init' in the plugin
spank-epilog: debug3: Couldn't find sym 'slurm_spank_task_post_fork' in the plugin
spank-epilog: debug3: Couldn't find sym 'slurm_spank_task_exit' in the plugin
spank-epilog: debug3: Couldn't find sym 'slurm_spank_slurmd_exit' in the plugin
spank-epilog: debug3: Couldn't find sym 'slurm_spank_exit' in the plugin
spank-epilog: debug:  spank: /etc/slurm/plugstack.conf:1: Loaded plugin auto_tmpdir.so
spank-epilog: debug:  SPANK: appending plugin option "no-rm-tmpdir"
spank-epilog: debug:  SPANK: appending plugin option "use-shared-tmpdir"
spank-epilog: debug:  auto_tmpdir::__auto_tmpdir_fs_default_state_file: 104
spank-epilog: debug:  auto_tmpdir::__auto_tmpdir_fs_default_state_file: state_dir=/var/tmp/auto_tmpdir_cache
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: `/dev/shm/slurm-104` -> `/dev/shm` (0|1) 0x1e90430
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: removing directory `/dev/shm/slurm-104`
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: moving to next directory 0x1e90430
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: `/mnt/scratch/tmpdir-104/tmp` -> `/tmp` (0|0) 0x1e90400
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: removing directory `/mnt/scratch/tmpdir-104/tmp`
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: moving to next directory 0x1e90400
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: `/mnt/scratch/tmpdir-104/var_tmp` -> `/var/tmp` (0|0) (nil)
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: removing directory `/mnt/scratch/tmpdir-104/var_tmp`
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: moving to next directory (nil)
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: removing directory `/mnt/scratch/tmpdir-104`
spank-epilog: debug2: spank: auto_tmpdir.so: job_epilog = 0
slurmd: debug:  unsetenv (SPANK__SLURM_SPANK_OPTION_auto_tmpdir_use_shared_tmpdir)
slurmd: debug:  completed epilog for jobid 104
slurmd: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x0
slurmd: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x0
slurmd: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x0
slurmd: debug4: xsignal: Swap signal PIPE[13] to 0x0 from 0x1
slurmd: debug:  JobId=104: sent epilog complete msg: rc = 0
slurmd: debug2: Finish processing RPC: REQUEST_TERMINATE_JOB
jtfrey commented 1 week ago

It definitely does execute — when the job is started there are prolog lines citing "auto_tmpdir":

spank-prolog: debug:  spank: /etc/slurm/plugstack.conf:1: Loaded plugin auto_tmpdir.so
spank-prolog: debug:  SPANK: appending plugin option "no-rm-tmpdir"
spank-prolog: debug:  SPANK: appending plugin option "use-shared-tmpdir"
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_init: 104 for owner 1000:1000
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_init: no_rm_shared_only set, ensuring no should_not_delete bit in options
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_init: local_prefix=/mnt/scratch/tmpdir-
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_init: shared_prefix=/arc/scratch1/jobs/job-
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: created directory `/mnt/scratch/tmpdir-104/tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: set ownership 1000:1000 on directory `/mnt/scratch/tmpdir-104/tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: added bindpoint `/mnt/scratch/tmpdir-104/tmp` -> `/tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: created directory `/mnt/scratch/tmpdir-104/var_tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: set ownership 1000:1000 on directory `/mnt/scratch/tmpdir-104/var_tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: added bindpoint `/mnt/scratch/tmpdir-104/var_tmp` -> `/var/tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: created directory `/dev/shm/slurm-104`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: set ownership 1000:1000 on directory `/dev/shm/slurm-104`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: added bindpoint `/dev/shm/slurm-104` -> `/dev/shm`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_default_state_file: 104
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_default_state_file: state_dir=/var/tmp/auto_tmpdir_cache
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_serialize_to_file: serialized to `/var/tmp/auto_tmpdir_cache/auto_tmpdir_fs-104.cache`
spank-prolog: debug2: spank: auto_tmpdir.so: job_prolog = 0

There are similar epilog lines in your cited log, too. How are you determining that the plugin is not creating the necessary directories?

dirkpetersen commented 1 week ago

I have /arc/scratch1/jobs/ configured as shared scratch and it never makes an attempt to execute:

cat /etc/slurm/plugstack.conf
required    auto_tmpdir.so  mount=/tmp mount=/var/tmp local_prefix=/mnt/scratch/tmpdir- shared_prefix=/arc/scratch1/jobs/job- no_rm_shared_only state_dir=/var/tmp/auto_tmpdir_cache
jtfrey commented 1 week ago

But the auto_tmpdir plugin definitely is executing when the job starts and ends, your log shows that. Note in your log that the local_prefix is being used, not the shared_prefix, so clearly there will be no directories created in /arc/scratch1/jobs. So why is the shared prefix not being used for your job — if you add -v to your sbatch/salloc, does the plugin display any information?

dirkpetersen commented 1 week ago

Yes, it works well for all local temp. I just get this with -vvvvvv . The shared /arc/scratch1/jobs is currently a folder on the local machine owned by the slurm user. .....or are you saying that if I have configured both local_prefix=/mnt/scratch/tmpdir- and shared_prefix=/arc/scratch1/slurm/job- in plugstack.conf that it will never work ?

[rocky@dirk1 ~]$ sbatch -vvvvvvvvvvvvvv --use-shared-tmpdir --wrap="hostname"
sbatch: auto_tmpdir:  will use shared tempororary directory under `/arc/scratch1/jobs`
sbatch: defined options
sbatch: -------------------- --------------------
sbatch: verbose             : 14
sbatch: wrap                : hostname
sbatch: -------------------- --------------------
sbatch: end of defined options
sbatch: debug2: spank: auto_tmpdir.so: init_post_opt = 0
sbatch: debug:  propagating RLIMIT_CPU=18446744073709551615
sbatch: debug:  propagating RLIMIT_FSIZE=18446744073709551615
sbatch: debug:  propagating RLIMIT_DATA=18446744073709551615
sbatch: debug:  propagating RLIMIT_STACK=8388608
sbatch: debug:  propagating RLIMIT_CORE=0
sbatch: debug:  propagating RLIMIT_RSS=18446744073709551615
sbatch: debug:  propagating RLIMIT_NPROC=29700
sbatch: debug:  propagating RLIMIT_NOFILE=1024
sbatch: debug:  propagating RLIMIT_MEMLOCK=8388608
sbatch: debug:  propagating RLIMIT_AS=18446744073709551615
sbatch: debug:  propagating SLURM_PRIO_PROCESS=0
sbatch: debug:  propagating UMASK=0022
sbatch: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x0
sbatch: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x0
sbatch: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x0
sbatch: debug4: xsignal: Swap signal PIPE[13] to 0x0 from 0x1
Submitted batch job 107
dirkpetersen commented 1 week ago

there is a little more on an interactive session:

[rocky@dirk1 ~]$ srun -vvvvvvvvvv --use-shared-tmpdir --pty bash
srun: defined options
srun: -------------------- --------------------
srun: pty                 :
srun: verbose             : 10
srun: -------------------- --------------------
srun: end of defined options
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=8388608
srun: debug:  propagating RLIMIT_CORE=0
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=29700
srun: debug:  propagating RLIMIT_NOFILE=1024
srun: debug:  propagating RLIMIT_MEMLOCK=8388608
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug:  Entering slurm_allocation_msg_thr_create()
srun: debug:  port from net_stream_listen is 42175
srun: debug:  Entering _msg_thr_internal
srun: debug4: eio: handling events for 1 objects
srun: debug3: eio_message_socket_readable: shutdown 0 fd 3
srun: debug4: xsignal: Swap signal INT[2] to 0x4097ff from 0x0
srun: debug4: xsignal: Swap signal QUIT[3] to 0x4097ff from 0x0
srun: debug4: xsignal: Swap signal CONT[18] to 0x4097ff from 0x0
srun: debug4: xsignal: Swap signal TERM[15] to 0x4097ff from 0x0
srun: debug4: xsignal: Swap signal HUP[1] to 0x4097ff from 0x0
srun: debug4: xsignal: Swap signal ALRM[14] to 0x4097ff from 0x0
srun: debug4: xsignal: Swap signal USR1[10] to 0x4097ff from 0x0
srun: debug4: xsignal: Swap signal USR2[12] to 0x4097ff from 0x0
srun: debug4: xsignal: Swap signal PIPE[13] to 0x4097ff from 0x0
srun: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x4097ff
srun: debug4: xsignal: Swap signal ALRM[14] to 0x4097ff from 0x0
srun: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x4097ff
srun: debug4: xsignal: Swap signal PIPE[13] to 0x4097ff from 0x1
srun: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x4097ff
srun: debug4: xsignal: Swap signal ALRM[14] to 0x4097ff from 0x0
srun: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x4097ff
srun: debug4: xsignal: Swap signal PIPE[13] to 0x4097ff from 0x1
srun: Nodes localhost are ready for job
srun: jobid 110: nodes(1):`localhost', cpu counts: 1(x1)
srun: debug2: creating job with 1 tasks
srun: debug2: cpu:1 is not a gres
srun: debug:  requesting job 110, user 99, nodes 1 including ((null))
srun: debug:  cpus 1, tasks 1, name bash, relative 65534
srun: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x4097ff
srun: debug4: xsignal: Swap signal ALRM[14] to 0x4097ff from 0x0
srun: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x4097ff
srun: debug4: xsignal: Swap signal PIPE[13] to 0x4097ff from 0x1
srun: debug2: winsize 45:240
srun: debug2: initialized job control port 40911
                                                srun: debug4: xsignal: Swap signal 28[28] to 0x4174ab from 0x0
                                                                                                              srun: CpuBindType=(null type)
srun: debug:  Entering slurm_step_launch
srun: debug:  Entering _msg_thr_create()
srun: debug4: eio: handling events for 2 objects
srun: debug3: eio_message_socket_readable: shutdown 0 fd 12
srun: debug3: eio_message_socket_readable: shutdown 0 fd 6
srun: debug:  initialized stdio listening socket, port 36201
srun: debug4: xsignal: Swap signal TTIN[21] to 0x1 from 0x0
srun: debug:  Started IO server thread
srun: debug3: IO thread pid = 23851
srun: debug4: eio: handling events for 4 objects
srun: debug2: Called _file_readable
srun: debug3:   false, all ioservers not yet initialized
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug:  Entering _launch_tasks
srun: launching StepId=110.0 on host localhost, 1 tasks: 0
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug3: Called _listening_socket_readable
srun: debug3: Trying to load plugin /usr/lib64/slurm/topology_default.so
srun: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:topology Default plugin type:topology/default version:0x170b0a
srun: topology/default: init: topology Default plugin loaded
srun: debug3: Success.
srun: debug2: Tree head got back 0 looking for 1
srun: debug3: Tree sending to localhost
srun: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x4097ff
srun: debug4: xsignal: Swap signal ALRM[14] to 0x4097ff from 0x0
srun: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x4097ff
srun: debug4: xsignal: Swap signal PIPE[13] to 0x4097ff from 0x1
srun: debug2: Tree head got back 1
srun: debug:  launch returned msg_rc=0 err=0 type=8001
srun: debug2: waiting for SIGWINCH
srun: debug3: Called _listening_socket_read
srun: debug2: Activity on IO listening socket 15
srun: debug3: Accepted IO connection: ip=127.0.0.1:34200 sd=16
srun: debug2: Entering io_init_msg_read_from_fd
srun: debug2: Leaving io_init_msg_read_from_fd
srun: debug2: Entering io_init_msg_validate
srun: debug3:   msg->version = 2800
srun: debug3:   msg->nodeid = 0
srun: debug2: Leaving io_init_msg_validate
srun: debug2: Validated IO connection from 127.0.0.1:34200, node rank 0, sd=16
srun: debug3: msg.stdout_objs = 1
srun: debug3: msg.stderr_objs = 0
srun: debug3: eio_message_socket_accept: start
srun: debug2: eio_message_socket_accept: got message connection from 127.0.0.1:60128 17
srun: debug2: received task launch
srun: Node localhost, 1 tasks started
srun: debug3: task_state_update: StepId=110.0 task_id=0, TS_START_SUCCESS
srun: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x4097ff
srun: debug4: xsignal: Swap signal PIPE[13] to 0x4097ff from 0x1
srun: debug4: eio: handling events for 2 objects
srun: debug3: eio_message_socket_readable: shutdown 0 fd 12
srun: debug3: eio_message_socket_readable: shutdown 0 fd 6
srun: debug4: eio: handling events for 5 objects
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug3: Called _listening_socket_readable
srun: debug4: Called _server_writable
srun: debug4:   false
srun: debug4: Called _server_readable
srun: debug4: remote_stdout_objs = 1
srun: debug4: remote_stderr_objs = 0
srun: debug4: Entering _server_read
srun: debug3: Entering io_hdr_read_fd
srun: debug3: Leaving io_hdr_read_fd
srun: debug4: eio: handling events for 5 objects
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug3: Called _listening_socket_readable
srun: debug4: Called _server_writable
srun: debug4:   false
srun: debug4: Called _server_readable
srun: debug4: remote_stdout_objs = 1
srun: debug4: remote_stderr_objs = 0
srun: debug2: Entering _file_write
srun: debug3:   wrote 18 bytes
srun: debug2: Leaving  _file_write
srun: debug4: eio: handling events for 5 objects
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug3: Called _listening_socket_readable
srun: debug4: Called _server_writable
srun: debug4:   false
srun: debug4: Called _server_readable
srun: debug4: remote_stdout_objs = 1
srun: debug4: remote_stderr_objs = 0
srun: debug4: Entering _server_read
srun: debug3: Entering io_hdr_read_fd
srun: debug3: Leaving io_hdr_read_fd
srun: debug4: eio: handling events for 5 objects
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug3: Called _listening_socket_readable
srun: debug4: Called _server_writable
srun: debug4:   false
srun: debug4: Called _server_readable
srun: debug4: remote_stdout_objs = 1
srun: debug4: remote_stderr_objs = 0
srun: debug2: Entering _file_write
srun: debug3:   wrote 8 bytes
srun: debug2: Leaving  _file_write
srun: debug4: eio: handling events for 5 objects
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug3: Called _listening_socket_readable
srun: debug4: Called _server_writable
srun: debug4:   false
srun: debug4: Called _server_readable
srun: debug4: remote_stdout_objs = 1
srun: debug4: remote_stderr_objs = 0
srun: debug4: Entering _server_read
srun: debug3: Entering io_hdr_read_fd
srun: debug3: Leaving io_hdr_read_fd
srun: debug4: eio: handling events for 5 objects
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug3: Called _listening_socket_readable
srun: debug4: Called _server_writable
srun: debug4:   false
srun: debug4: Called _server_readable
srun: debug4: remote_stdout_objs = 1
srun: debug4: remote_stderr_objs = 0
srun: debug2: Entering _file_write
[rocky@dirk1 ~]$ srun: debug3:   wrote 17 bytes
srun: debug2: Leaving  _file_write
srun: debug4: eio: handling events for 5 objects
srun: debug2: Called _file_readable
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug2: Called _file_writable
srun: debug3:   false
srun: debug3:   eof is false
srun: debug3: Called _listening_socket_readable
srun: debug4: Called _server_writable
jtfrey commented 1 week ago

I can confirm that it does not work on our system, either (Slurm 20.11.5). The problem is w.r.t. getting the SPANK prolog/epilog steps able to "see" the CLI options that were presented to sbatch/salloc.

I made a new branch with modifications meant to fixup the CLI option handling in the prolog/epilog contexts. Please give the new branch a try.

dirkpetersen commented 1 week ago

I tried the new code but it does not seem to make a difference:

cmake -DCMAKE_BUILD_TYPE=Release       
 -DSLURM_MODULES_DIR=/usr/lib64/slurm       
 -DAUTO_TMPDIR_ENABLE_SHARED_TMPDIR=On       
 -DAUTO_TMPDIR_DEFAULT_SHARED_PREFIX=/arc/scratch1/jobs        
  ..
[rocky@dirk1 ~]$ sbatch -vvvvvvvvvvvvvv --use-shared-tmpdir --wrap="hostname"
sbatch: auto_tmpdir:  will use shared tempororary directory under `/arc/scratch1/jobs`
sbatch: defined options
sbatch: -------------------- --------------------
sbatch: verbose             : 14
sbatch: wrap                : hostname
sbatch: -------------------- --------------------
sbatch: end of defined options
sbatch: debug2: spank: auto_tmpdir.so: init_post_opt = 0
sbatch: debug:  propagating RLIMIT_CPU=18446744073709551615
sbatch: debug:  propagating RLIMIT_FSIZE=18446744073709551615
sbatch: debug:  propagating RLIMIT_DATA=18446744073709551615
sbatch: debug:  propagating RLIMIT_STACK=8388608
sbatch: debug:  propagating RLIMIT_CORE=0
sbatch: debug:  propagating RLIMIT_RSS=18446744073709551615
sbatch: debug:  propagating RLIMIT_NPROC=29700
sbatch: debug:  propagating RLIMIT_NOFILE=1024
sbatch: debug:  propagating RLIMIT_MEMLOCK=8388608
sbatch: debug:  propagating RLIMIT_AS=18446744073709551615
sbatch: debug:  propagating SLURM_PRIO_PROCESS=0
sbatch: debug:  propagating UMASK=0022
sbatch: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x0
sbatch: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x0
sbatch: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x0
sbatch: debug4: xsignal: Swap signal PIPE[13] to 0x0 from 0x1
Submitted batch job 112
slurmd -D -vvvvvvvv
.
.
.
slurmd: debug:  prep/script: _run_spank_job_script: _run_spank_job_script: calling /usr/sbin/slurmstepd spank prolog
spank-prolog: debug2: debug level read from slurmd is 'unknown'.
spank-prolog: debug2: _read_slurmd_conf_lite: slurmd sent 8 TRES.
spank-prolog: debug:  Running spank/prolog for jobid [112] uid [1000] gid [1000]
spank-prolog: debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
spank-prolog: debug3: plugin_peek: dlopen(/usr/lib64/slurm/auto_tmpdir.so): /usr/lib64/slurm/auto_tmpdir.so: cannot open shared object file: No such file or directory
spank-prolog: debug3: plugin_peek->_verify_syms: found Slurm plugin name:auto_tmpdir type:spank version:0x170b0a
spank-prolog: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:auto_tmpdir type:spank version:0x170b0a
spank-prolog: debug3: Couldn't find sym 'slurm_spank_local_user_init' in the plugin
spank-prolog: debug3: Couldn't find sym 'slurm_spank_user_init' in the plugin
spank-prolog: debug3: Couldn't find sym 'slurm_spank_task_init_privileged' in the plugin
spank-prolog: debug3: Couldn't find sym 'slurm_spank_task_init' in the plugin
spank-prolog: debug3: Couldn't find sym 'slurm_spank_task_post_fork' in the plugin
spank-prolog: debug3: Couldn't find sym 'slurm_spank_task_exit' in the plugin
spank-prolog: debug3: Couldn't find sym 'slurm_spank_slurmd_exit' in the plugin
spank-prolog: debug3: Couldn't find sym 'slurm_spank_exit' in the plugin
spank-prolog: debug:  spank: /etc/slurm/plugstack.conf:1: Loaded plugin auto_tmpdir.so
spank-prolog: debug:  SPANK: appending plugin option "no-rm-tmpdir"
spank-prolog: debug:  SPANK: appending plugin option "use-shared-tmpdir"
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_init: 112 for owner 1000:1000
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_init: no_rm_shared_only set, ensuring no should_not_delete bit in options
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_init: local_prefix=/mnt/scratch/tmpdir-
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_init: shared_prefix=/arc/scratch1/jobs/job-
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: created directory `/mnt/scratch/tmpdir-112/tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: set ownership 1000:1000 on directory `/mnt/scratch/tmpdir-112/tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: added bindpoint `/mnt/scratch/tmpdir-112/tmp` -> `/tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: created directory `/mnt/scratch/tmpdir-112/var_tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: set ownership 1000:1000 on directory `/mnt/scratch/tmpdir-112/var_tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: added bindpoint `/mnt/scratch/tmpdir-112/var_tmp` -> `/var/tmp`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: created directory `/dev/shm/slurm-112`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: set ownership 1000:1000 on directory `/dev/shm/slurm-112`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_create_bindpoint: added bindpoint `/dev/shm/slurm-112` -> `/dev/shm`
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_default_state_file: 112
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_default_state_file: state_dir=/var/tmp/auto_tmpdir_cache
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_serialize_to_file: serialized to `/var/tmp/auto_tmpdir_cache/auto_tmpdir_fs-112.cache`
spank-prolog: debug2: spank: auto_tmpdir.so: job_prolog = 0
slurmd: debug:  unsetenv (SPANK__SLURM_SPANK_OPTION_auto_tmpdir_use_shared_tmpdir)
slurmd: Launching batch job 112 for UID 1000
slurmd: debug3: _rpc_batch_job: call to _forkexec_slurmstepd
slurmd: debug3: slurmstepd rank -1 (localhost), parent rank -1 (NONE), children 0, depth 0, max_depth 0
slurmd: debug3: _rpc_batch_job: return from _forkexec_slurmstepd: 0
slurmd: debug2: Finish processing RPC: REQUEST_BATCH_JOB_LAUNCH
slurmd: debug3: in the service_connection
slurmd: debug2: Start processing RPC: REQUEST_TERMINATE_JOB
slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd: debug:  _rpc_terminate_job: uid = 5829 JobId=112
slurmd: debug:  credential for job 112 revoked
slurmd: debug2: No steps in jobid 112 to send signal 18
slurmd: debug2: No steps in jobid 112 to send signal 15
slurmd: debug4: sent SUCCESS
slurmd: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x0
slurmd: debug4: xsignal: Swap signal PIPE[13] to 0x0 from 0x1
slurmd: debug2: set revoke expiration for jobid 112 to 1725480624 UTS
slurmd: debug:  Waiting for job 112's prolog to complete
slurmd: debug:  Finished wait for job 112's prolog to complete
slurmd: debug:  prep/script: _run_spank_job_script: _run_spank_job_script: calling /usr/sbin/slurmstepd spank epilog
spank-epilog: debug2: debug level read from slurmd is 'unknown'.
spank-epilog: debug2: _read_slurmd_conf_lite: slurmd sent 8 TRES.
spank-epilog: debug:  Running spank/epilog for jobid [112] uid [1000] gid [1000]
spank-epilog: debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
spank-epilog: debug3: plugin_peek: dlopen(/usr/lib64/slurm/auto_tmpdir.so): /usr/lib64/slurm/auto_tmpdir.so: cannot open shared object file: No such file or directory
spank-epilog: debug3: plugin_peek->_verify_syms: found Slurm plugin name:auto_tmpdir type:spank version:0x170b0a
spank-epilog: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:auto_tmpdir type:spank version:0x170b0a
spank-epilog: debug3: Couldn't find sym 'slurm_spank_local_user_init' in the plugin
spank-epilog: debug3: Couldn't find sym 'slurm_spank_user_init' in the plugin
spank-epilog: debug3: Couldn't find sym 'slurm_spank_task_init_privileged' in the plugin
spank-epilog: debug3: Couldn't find sym 'slurm_spank_task_init' in the plugin
spank-epilog: debug3: Couldn't find sym 'slurm_spank_task_post_fork' in the plugin
spank-epilog: debug3: Couldn't find sym 'slurm_spank_task_exit' in the plugin
spank-epilog: debug3: Couldn't find sym 'slurm_spank_slurmd_exit' in the plugin
spank-epilog: debug3: Couldn't find sym 'slurm_spank_exit' in the plugin
spank-epilog: debug:  spank: /etc/slurm/plugstack.conf:1: Loaded plugin auto_tmpdir.so
spank-epilog: debug:  SPANK: appending plugin option "no-rm-tmpdir"
spank-epilog: debug:  SPANK: appending plugin option "use-shared-tmpdir"
spank-epilog: debug:  auto_tmpdir::__auto_tmpdir_fs_default_state_file: 112
spank-epilog: debug:  auto_tmpdir::__auto_tmpdir_fs_default_state_file: state_dir=/var/tmp/auto_tmpdir_cache
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: `/dev/shm/slurm-112` -> `/dev/shm` (0|1) 0x1d8e430
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: removing directory `/dev/shm/slurm-112`
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: moving to next directory 0x1d8e430
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: `/mnt/scratch/tmpdir-112/tmp` -> `/tmp` (0|0) 0x1d8e400
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: removing directory `/mnt/scratch/tmpdir-112/tmp`
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: moving to next directory 0x1d8e400
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: `/mnt/scratch/tmpdir-112/var_tmp` -> `/var/tmp` (0|0) (nil)
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: removing directory `/mnt/scratch/tmpdir-112/var_tmp`
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: moving to next directory (nil)
spank-epilog: debug:  auto_tmpdir::auto_tmpdir_fs_bindpoint_dealloc: removing directory `/mnt/scratch/tmpdir-112`
spank-epilog: debug2: spank: auto_tmpdir.so: job_epilog = 0
slurmd: debug:  unsetenv (SPANK__SLURM_SPANK_OPTION_auto_tmpdir_use_shared_tmpdir)
slurmd: debug:  completed epilog for jobid 112
slurmd: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x0
slurmd: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x0
slurmd: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x0
slurmd: debug4: xsignal: Swap signal PIPE[13] to 0x0 from 0x1
slurmd: debug:  JobId=112: sent epilog complete msg: rc = 0
slurmd: debug2: Finish processing RPC: REQUEST_TERMINATE_JOB
jtfrey commented 1 week ago
   :
spank-prolog: debug:  auto_tmpdir::__auto_tmpdir_fs_default_state_file: state_dir=/var/tmp/auto_tmpdir_cache
spank-prolog: debug:  auto_tmpdir::auto_tmpdir_fs_serialize_to_file: serialized to `/var/tmp/auto_tmpdir_cache/auto_tmpdir_fs-112.cache`
spank-prolog: debug2: spank: auto_tmpdir.so: job_prolog = 0
slurmd: debug:  unsetenv (SPANK__SLURM_SPANK_OPTION_auto_tmpdir_use_shared_tmpdir)
   :

The fact that the job environment contains SPANK__SLURM_SPANK_OPTION_auto_tmpdir_use_shared_tmpdir (hence its being removed after the prolog step) should mean that spank_option_getopt() will return that CLI option value. You're 100% certain the new copy of the plugin you compiled (from the test branch) is installed and being loaded by slurmd?

dirkpetersen commented 1 week ago

yes, I am sure, i just added a few fprintf statements to confirm , see below

[rocky@dirk1 ~]$ sbatch -vvvvvvvvvvvvvv --use-shared-tmpdir --wrap="hostname"
***** slurm_spank_init, loading version from 2024-09-04 1:39PM PT
***** _opt_use_shared_tmpdir, BEGIN
sbatch: auto_tmpdir:  will use shared tempororary directory under `/arc/scratch1/jobs`
***** _opt_use_shared_tmpdir, END
sbatch: defined options
sbatch: -------------------- --------------------
sbatch: verbose             : 14
sbatch: wrap                : hostname
sbatch: -------------------- --------------------
sbatch: end of defined options
sbatch: debug2: spank: auto_tmpdir.so: init_post_opt = 0
sbatch: debug:  propagating RLIMIT_CPU=18446744073709551615
sbatch: debug:  propagating RLIMIT_FSIZE=18446744073709551615
sbatch: debug:  propagating RLIMIT_DATA=18446744073709551615
sbatch: debug:  propagating RLIMIT_STACK=8388608
sbatch: debug:  propagating RLIMIT_CORE=0
sbatch: debug:  propagating RLIMIT_RSS=18446744073709551615
sbatch: debug:  propagating RLIMIT_NPROC=29700
sbatch: debug:  propagating RLIMIT_NOFILE=1024
sbatch: debug:  propagating RLIMIT_MEMLOCK=8388608
sbatch: debug:  propagating RLIMIT_AS=18446744073709551615
sbatch: debug:  propagating SLURM_PRIO_PROCESS=0
sbatch: debug:  propagating UMASK=0022
sbatch: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x0
sbatch: debug4: xsignal: Swap signal ALRM[14] to 0x0 from 0x0
sbatch: debug4: xsignal: Swap signal PIPE[13] to 0x1 from 0x0
sbatch: debug4: xsignal: Swap signal PIPE[13] to 0x0 from 0x1
dirkpetersen commented 1 week ago

btw, since the slurm dev install instructions are not good, I documented the install of Slurm running only on localhost for dev purposes on RHEL9/Rocky9 if you find that useful

https://github.com/dirkpetersen/dptests/blob/main/slurm/slurm-install.md