Open jluethi opened 7 months ago
Can now confirm that I run into this issue reproducibly when many jobs run at once. But the same workflow runs through fine if the cluster is busy and only 150-300 srun statements are run at once (e.g. I have 10 jobs with 154 srun statements, it runs fine if 2 of them are submitted together. But 5-8 submitted together => Connection reset by peer
issue)
Some updates on 1536 well plate testing: Adapting the server config settings to allow 50 slurm jobs instead of 10 slurm jobs to run at once got things running for this example (without adding the new sleep after sbatch mode). This lead to more queueing on slurm instead of the internal queues with the many srun calls at once.
Also, after the last slurm job finished, it took about 30s for Fractal web to mark the job as done. Given 1536 srun commands and some Slurm grace period for it writing all the files, that's quite nice performance!
Performance-wise: It's hard to judge with a fairly busy cluster as we've had during those tests. Converting the 1536 well plate, running 2 cellpose at full res & 2 measure features (not the fastest task due to slow texture measurements) took somewhere on the order of 10h. That's slower than I'd expect for 2D only data with just 1 FOV per well. But it's full res & slow measurements. Plus, a big part is due to busy cluster (e.g. Measurement task ran once in 2.3h, once in 35min depending on how busy the cluster was). Having to initialize cellpose models for 1 image per well was certainly an inefficiency though
Useful scripts for testing:
#!/bin/bash
DATE=$(date)
echo "DATE=$DATE 1=$1 2=$2" >> "dummy-output/$1_$2"
sleep 1
#!/bin/sh
#SBATCH --partition=REDACTED
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --mem=17G
#SBATCH --nodes=1
#SBATCH --job-name=test-154-srun
#SBATCH --err=slurm_%j.out
#SBATCH --out=slurm_%j.err
echo "Working directory (pwd): `pwd`"
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=1GB ./single_task.sh $1 0 &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=1GB ./single_task.sh $1 1 &
...
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=1GB ./single_task.sh $1 152 &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=1GB ./single_task.sh $1 153 &
wait
#!/bin/bash
rm -r dummy-output
mkdir dummy-output
for INDEX in {1..10}; do
date
sbatch single_submit.sh $INDEX
done
After running many_submit.sh
, and waiting for the ten SLURM jobs to be over, one can count the number of files in dummy-output
. The expected number is 1540, but in some cases we do observe a smaller number (e.g. 1529). In this case, the SLURM output files include logs like (for SLURM v15):
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: Unable to create job step: Transport endpoint is not connected
srun: error: Unable to create job step: Transport endpoint is not connected
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: Unable to confirm allocation for job 10410780: Transport endpoint is not connected
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
srun: error: Unable to confirm allocation for job 10410780: Transport endpoint is not connected
srun: error: Unable to confirm allocation for job 10410780: Transport endpoint is not connected
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: Unable to confirm allocation for job 10410781: Transport endpoint is not connected
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: Unable to create job step: Transport endpoint is not connected
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: Unable to confirm allocation for job 10410784: Transport endpoint is not connected
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: Unable to create job step: Transport endpoint is not connected
srun: error: Unable to create job step: Transport endpoint is not connected
srun: error: io_init_msg_read too small
srun: error: failed reading io init message
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: Unable to create job step: Transport endpoint is not connected
For the record, we hit again this issue while debugging #1507 with @mfranzon.
In that case, the SLURM job was made of tens of srun
s, and the SLURM node had very limited resources (2-4 CPUs and 2G of memory). The simultaneous submission of many srun
s, together with their retries, lead to a communication error again (something related to socket not receiving SLURM messages).
For https://github.com/fractal-analytics-platform/fractal-server/issues/1459#issuecomment-2120584669, we dealt with it by splitting the list of srun
s into blocks, with several wait
lines.
Noting that we observed something similar again. This was with fractal-server 2.3.0a2, with about 1100 submission scripts like this
#!/bin/sh
#SBATCH --partition=main
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=2
#SBATCH --mem=60000M
#SBATCH --job-name=XXXX
#SBATCH --err=/XXXX/10_batch_001137_slurm_%j.err
#SBATCH --out=/XXX/10_batch_001137_slurm_%j.out
#SBATCH -D XXX
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=7500MB /XXX/server/fractal-server-env/bin/python -m fractal_server.app.runner.executors.slurm.remote --input-file /XXX/server/artifacts/XXX_in_Luti7Cjy0gJMNTDDNfXU90LVhC8vvRxh.pickle --output-file /XXX/XXX_out_Luti7Cjy0gJMNTDDNfXU90LVhC8vvRxh.pickle &
...
# seven more srun statements like this
wait
for a total of around 9000 images.
At some point, at least one job failed like this:
$ cat 10_batch_000675_slurm_10448428.err
slurmstepd: slurm_receive_msg: Socket timed out on send/recv operation
I'm running a 1536 well workflow on the FMI deployment and will report some stress-testing take-aways.
This is a (sanitized) version of the workflow I'm running:
1536_well_workflow.json
Number of log files for a single (parallel) task in the user folder: 6144 4 per task, 1x metadiff.json, 1x the pickle out file, 1x .args.json, 1x .log
On the server side of logs: I still get tons of files like
4_par_0001535.metadiff.json
, but all they contain are "null" => not sure why those get written to disk [but they came from legacy tasks] Besides the files like3_par_0001204.metadiff.json
, I also get3_par_0001205.args.json
server-side. But server-side, I don't get the .log files here (is this intentional?)I sometimes get failing workflows due to job execution errors with errors like: example_error_1536_well_plate.txt
Most likely explanation: srun: error: Unable to confirm allocation for job 193231: Connection reset by peer srun: error: Unable to confirm allocation for job 193231: Connection reset by peer srun: error: Unable to confirm allocation for job 193231: Connection reset by peer srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 193231 srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 193231 srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 193231
Most likely, these jobs are lost Because for example jobs, we had 116 run successfully, 38 times the
Connection reset by peer
error, summing up to the total 154.