Closed mathaefele closed 4 years ago
Hi Mat, thanks for the issue!
What I don't get is that it seems, from the very first line of the your first example log, that you are logged in on miriel056 (the central Redis instance is launched locally on the node the user is logged in), then from this node you must be running sbatch (I guess), but then everything is scheduled on this same node miriel056...that's a weird sbatch behaviour...but I have missed something probably.
No, no. The login nodes is "devel02" and mirielXXX are compute nodes. This run has been allocated miriel056 and miriel057. So the central server is launched on miriel056 as well as the simulation and the post-processing. And that's one of the question...
Ok thanks, I think I start to understand a bit better.
The central Redis instance is not the instance where data are stored. It is a sort of manager instance that is used in the process of spawning the cluster of Redis instances that will be used for staging data. This central Redis instance is launched directly by executing the Redis binary, not through srun. While the other Redis instances are launched through srun.
The consequence of this is that the central Redis instance is run on the login node if one use salloc to run the job (what I usually do for debugging) or on one of the allocated nodes in case of sbatch (what you are doing). And btw, I just realized salloc and sbatch have a different behaviour in this respect, which is why I got initially confused...
So the question now is: on which node the Redis instance used for staging data has been run ? Is it on miriel056 or miriel057 ?
You should be able to check it if you run sacct -j your_job_id -o JobName,NodeList
and check on which the node the job step "redis.srun" is run.
As for your second issue, I will look into it.
mhaefele@devel03:C $ sacct -j 3781 -o JobName,NodeList sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to localhost:6819: Connection refused sacct: error: slurmdbd: Sending PersistInit msg: Connection refused sacct: error: Problem talking to the database: Connection refused
... I contact my admins and I come back to you when I have the required inputs.
sacct -j 3781 -o JobName,NodeList JobName NodeList
pdwfs_hel+ miriel[056-057] batch miriel056 extern miriel[056-057] redis.srun miriel056 pdwfs miriel056 pdwfs miriel056
I am not sure to understand, I do not see my simu, neither my post-processing... But they print they are running on miriel056. So, everything seems to run on miriel056...
Ok thanks, there must be some slurm configuration magic I am not aware of...
Could you try launching your applications using the -r
option of srun ? This makes explicit on which node(s) you want to run your app using a relative numbering scheme starting at 0:
srun -r1 --mpi=none -N 1 -n 1 $WITH_PDWFS ./simu
...
srun -r1 --mpi=none -N 1 -n 1 $WITH_PDWFS ./post-process
and regarding your simu and post-processing in sacct, since they are wrapped by the pdwfs
command line script, that's what slurm is recording. Not very handy i admit...
I made some tests with the -r option, and indeed, the processes are executed on different nodes.
But I have non reproducible behaviours. The same script executed on the same nodes sometimes gives the correct result, sometimes breaks with a very similar error as the one mentioned above:
PDWFS][init] Start central Redis instance on miriel018.plafrim.cluster:34000
Could not connect to Redis at miriel018.plafrim.cluster:34000: Connection refused
[PDWFS][init] Error: the central Redis instance is not responding
panic: dial tcp :6379: connect: connection refused
goroutine 17 [running, locked to thread]:
github.com/cea-hpc/pdwfs/redisfs.Try(...)
/home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/redis.go:38
....
And I tried several times this afternoon, and with the -r option, it was always broken... I am roaming in the dark...
After several trials and errors, I managed to make it work the several redis instances and the post-processing on one node and the simulation on another !
The easiest setup is to work with an interactive job. Some still not understood combinations of sbatch + creating and working directories + having some bash commands failing + running on same node as a previous failed or successful run still produce the error in the issue text.
I close this issue as it is not an issue any more. I come back to you with a more precise issue on this next time hopefully.
Describe the bug
The title is the first issue. Up to two redis servers, it works, data are correct in result file and i get the following std output:
However, redis servers, simulation and post-processing are all running on the node miriel056. I tried around several options but did not manage to get anything else.
To Reproduce
The job script that uses the my C hello worlds from #2:
I tried to fill the first 16 cores of the first node with redis instances and it works with 2 redis instances but not more. I get the following error message with 4:
Expected behavior
I would like to have a way of telling pdwfs to run on a different node than the simulation. There seems to be ways for that for slurm but, as everything is embedded in
pdwfs-slurm
, I do not know to which extent this has to be put back in the job scriptThanks for yout help. Mat