cea-hpc / pdwfs

A simple Redis-backed distributed virtual filesystem for co-execution of HPC and data analytics workloads
Apache License 2.0
14 stars 1 forks source link

Cannot run the redis servers and the simulation on different resources #6

Closed mathaefele closed 4 years ago

mathaefele commented 4 years ago

Describe the bug

The title is the first issue. Up to two redis servers, it works, data are correct in result file and i get the following std output:

[PDWFS][init] Start central Redis instance on miriel056.plafrim.cluster:34000
waitkey 1
[PDWFS][59705][TRACE][C] intercepting fopen(path=staged/Cpok, mode=w)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fclose(stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting close(fd=5)
[PDWFS][59705][TRACE][C] intercepting close(fd=5)
[PDWFS][59705][TRACE][C] calling libc close
simu: running on host miriel056
redis-cli -h miriel056.plafrim.cluster -p 34000 --scan
addr
PONG
[PDWFS][59742][TRACE][C] intercepting fopen(path=staged/Cpok, mode=r)
[PDWFS][59742][TRACE][C] intercepting fread(ptr=0x7fff707ac8c0, size=1, nmemb=2560, stream=0xbbef00)
[PDWFS][59742][TRACE][C] intercepting fclose(stream=0xbbef00)
[PDWFS][59742][TRACE][C] intercepting close(fd=5)
[PDWFS][59742][TRACE][C] intercepting close(fd=5)
[PDWFS][59742][TRACE][C] calling libc close
[PDWFS][59742][TRACE][C] intercepting fopen(path=resC, mode=w)
[PDWFS][59742][TRACE][C] calling libc fopen
[PDWFS][59742][TRACE][C] intercepting fprintf(stream=0xbbf140, ...)
[PDWFS][59742][TRACE][C] intercepting fputs(s=Hello444
, stream=0xbbf140)
[PDWFS][59742][TRACE][C] calling libc fputs
[PDWFS][59742][TRACE][C] intercepting fclose(stream=0xbbf140)
[PDWFS][59742][TRACE][C] calling libc fclose
post-process: running on host miriel056
post-process: Hello444
waitkey 1

However, redis servers, simulation and post-processing are all running on the node miriel056. I tried around several options but did not manage to get anything else.

To Reproduce

The job script that uses the my C hello worlds from #2:

#!/bin/bash
#SBATCH --job-name=pdwfs_hello
#SBATCH --time=0:02:00
#SBATCH --nodes=2

work_directory="${SLURM_JOB_NAME}_${SLURM_JOB_ID}"
mkdir -p "${work_directory}/staged"
cd "${work_directory}"
ln ../simu .
ln ../post-process .

echo $SLURM_JOB_NODELIST > node_list

# Initialize the Redis instances:
pdwfs-slurm init -N 1 -n 1 -i ib0

# pdwfs-slurm produces a session file with some environment variables to source
source pdwfs.session

# pdwfs command will forward all I/O in $SCRATCHDIR in Redis instances
WITH_PDWFS="pdwfs -t -p staged"

# Execute ior benchmark on 128 tasks
srun --mpi=none -N 1 -n 1 $WITH_PDWFS ./simu 
host=`echo $PDWFS_CENTRAL_REDIS |cut -d':' -f 1`
port=`echo $PDWFS_CENTRAL_REDIS |cut -d':' -f 2`
echo "redis-cli -h $host -p $port --scan"
redis-cli -h $host -p $port --scan
redis-cli -h $host -p $port ping

srun --mpi=none -N 1 -n 1 $WITH_PDWFS ./post-process 

# gracefully shuts down Redis instances
pdwfs-slurm finalize

# pdwfs-slurm uses srun in background to execute Redis instances
# wait for background srun to complete
wait

I tried to fill the first 16 cores of the first node with redis instances and it works with 2 redis instances but not more. I get the following error message with 4:

PDWFS][init] Start central Redis instance on miriel018.plafrim.cluster:34000
Could not connect to Redis at miriel018.plafrim.cluster:34000: Connection refused
[PDWFS][init] Error: the central Redis instance is not responding
panic: dial tcp :6379: connect: connection refused

goroutine 17 [running, locked to thread]:
github.com/cea-hpc/pdwfs/redisfs.Try(...)
    /home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/redis.go:38
github.com/cea-hpc/pdwfs/redisfs.Pipe.Do(0x2ba6bc3e7880, 0xc0000109d0, 0x2ba6bc0fb6e6, 0x4, 0xc00000c340, 0x2, 0x2)
    /home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/redis.go:207 +0x99
github.com/cea-hpc/pdwfs/redisfs.(*Inode).initMeta(0xc0000ce0f0, 0x180bc0fb401)
    /home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/inodes.go:61 +0x357
github.com/cea-hpc/pdwfs/redisfs.NewRedisFS(0xc00008a680, 0xc00000c280, 0xc000076d68)
    /home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/fs.go:85 +0x1bb
main.NewPdwFS(0xc000010970, 0xe)
    /home/capulj/sources/cea-hpc/pdwfs/src/go/pdwfs.go:83 +0xf9
main.InitPdwfs(0x7ffccbadb450, 0x0, 0x400)
    /home/capulj/sources/cea-hpc/pdwfs/src/go/pdwfs.go:158 +0x72
main._cgoexpwrap_c1e4f2bfaf13_InitPdwfs(0x7ffccbadb450, 0x0, 0x400)
    _cgo_gotypes.go:281 +0x41
/home/mhaefele/public/opt/pdwfs/bin/pdwfs : ligne 86 : 16433 Abandon                 (core dumped)$*
srun: error: miriel018: task 0: Exited with exit code 134
redis-cli -h  -p  --scan
Could not connect to Redis at -p:6379: Name or service not known
Could not connect to Redis at -p:6379: Name or service not known
panic: dial tcp :6379: connect: connection refused

goroutine 17 [running, locked to thread]:
github.com/cea-hpc/pdwfs/redisfs.Try(...)
    /home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/redis.go:38
srun: error: miriel018: task 0: Exited with exit code 134
github.com/cea-hpc/pdwfs/redisfs.Pipe.Do(0x2b40aa975880, 0xc0000109d0, 0x2b40aa6896e6, 0x4, 0xc00000c340, 0x2, 0x2)
    /home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/redis.go:207 +0x99
github.com/cea-hpc/pdwfs/redisfs.(*Inode).initMeta(0xc0000ce0f0, 0x180aa689401)
    /home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/inodes.go:61 +0x357
github.com/cea-hpc/pdwfs/redisfs.NewRedisFS(0xc00008a680, 0xc00000c280, 0xc000076d68)
    /home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/fs.go:85 +0x1bb
main.NewPdwFS(0xc000010970, 0xe)
    /home/capulj/sources/cea-hpc/pdwfs/src/go/pdwfs.go:83 +0xf9
main.InitPdwfs(0x7ffe11c712e0, 0x0, 0x400)
    /home/capulj/sources/cea-hpc/pdwfs/src/go/pdwfs.go:158 +0x72
main._cgoexpwrap_c1e4f2bfaf13_InitPdwfs(0x7ffe11c712e0, 0x0, 0x400)
    _cgo_gotypes.go:281 +0x41
/home/mhaefele/public/opt/pdwfs/bin/pdwfs : ligne 86 : 16473 Abandon                 (core dumped)$*
[PDWFS][finalize] Error: pdwfs-slurm init command failed

Expected behavior

I would like to have a way of telling pdwfs to run on a different node than the simulation. There seems to be ways for that for slurm but, as everything is embedded in pdwfs-slurm, I do not know to which extent this has to be put back in the job script

Thanks for yout help. Mat

JCapul commented 4 years ago

Hi Mat, thanks for the issue!

What I don't get is that it seems, from the very first line of the your first example log, that you are logged in on miriel056 (the central Redis instance is launched locally on the node the user is logged in), then from this node you must be running sbatch (I guess), but then everything is scheduled on this same node miriel056...that's a weird sbatch behaviour...but I have missed something probably.

mathaefele commented 4 years ago

No, no. The login nodes is "devel02" and mirielXXX are compute nodes. This run has been allocated miriel056 and miriel057. So the central server is launched on miriel056 as well as the simulation and the post-processing. And that's one of the question...

JCapul commented 4 years ago

Ok thanks, I think I start to understand a bit better.

The central Redis instance is not the instance where data are stored. It is a sort of manager instance that is used in the process of spawning the cluster of Redis instances that will be used for staging data. This central Redis instance is launched directly by executing the Redis binary, not through srun. While the other Redis instances are launched through srun.

The consequence of this is that the central Redis instance is run on the login node if one use salloc to run the job (what I usually do for debugging) or on one of the allocated nodes in case of sbatch (what you are doing). And btw, I just realized salloc and sbatch have a different behaviour in this respect, which is why I got initially confused...

So the question now is: on which node the Redis instance used for staging data has been run ? Is it on miriel056 or miriel057 ? You should be able to check it if you run sacct -j your_job_id -o JobName,NodeList and check on which the node the job step "redis.srun" is run.

As for your second issue, I will look into it.

mathaefele commented 4 years ago

mhaefele@devel03:C $ sacct -j 3781 -o JobName,NodeList sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to localhost:6819: Connection refused sacct: error: slurmdbd: Sending PersistInit msg: Connection refused sacct: error: Problem talking to the database: Connection refused

... I contact my admins and I come back to you when I have the required inputs.

mathaefele commented 4 years ago

sacct -j 3781 -o JobName,NodeList JobName NodeList


pdwfs_hel+ miriel[056-057] batch miriel056 extern miriel[056-057] redis.srun miriel056 pdwfs miriel056 pdwfs miriel056

I am not sure to understand, I do not see my simu, neither my post-processing... But they print they are running on miriel056. So, everything seems to run on miriel056...

JCapul commented 4 years ago

Ok thanks, there must be some slurm configuration magic I am not aware of...

Could you try launching your applications using the -r option of srun ? This makes explicit on which node(s) you want to run your app using a relative numbering scheme starting at 0:

srun -r1 --mpi=none -N 1 -n 1 $WITH_PDWFS ./simu ... srun -r1 --mpi=none -N 1 -n 1 $WITH_PDWFS ./post-process

and regarding your simu and post-processing in sacct, since they are wrapped by the pdwfs command line script, that's what slurm is recording. Not very handy i admit...

mathaefele commented 4 years ago

I made some tests with the -r option, and indeed, the processes are executed on different nodes.

But I have non reproducible behaviours. The same script executed on the same nodes sometimes gives the correct result, sometimes breaks with a very similar error as the one mentioned above:

PDWFS][init] Start central Redis instance on miriel018.plafrim.cluster:34000
Could not connect to Redis at miriel018.plafrim.cluster:34000: Connection refused
[PDWFS][init] Error: the central Redis instance is not responding
panic: dial tcp :6379: connect: connection refused

goroutine 17 [running, locked to thread]:
github.com/cea-hpc/pdwfs/redisfs.Try(...)
    /home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/redis.go:38
....

And I tried several times this afternoon, and with the -r option, it was always broken... I am roaming in the dark...

mathaefele commented 4 years ago

After several trials and errors, I managed to make it work the several redis instances and the post-processing on one node and the simulation on another !
The easiest setup is to work with an interactive job. Some still not understood combinations of sbatch + creating and working directories + having some bash commands failing + running on same node as a previous failed or successful run still produce the error in the issue text.

I close this issue as it is not an issue any more. I come back to you with a more precise issue on this next time hopefully.