Learnings on 1536 well stress test in Fractal V2

jluethi commented 7 months ago

I'm running a 1536 well workflow on the FMI deployment and will report some stress-testing take-aways.

This is a (sanitized) version of the workflow I'm running:

Number of log files for a single (parallel) task in the user folder: 6144 4 per task, 1x metadiff.json, 1x the pickle out file, 1x .args.json, 1x .log

On the server side of logs: I still get tons of files like 4_par_0001535.metadiff.json, but all they contain are "null" => not sure why those get written to disk [but they came from legacy tasks] Besides the files like 3_par_0001204.metadiff.json, I also get 3_par_0001205.args.json server-side. But server-side, I don't get the .log files here (is this intentional?)

I sometimes get failing workflows due to job execution errors with errors like: example_error_1536_well_plate.txt

Most likely explanation: srun: error: Unable to confirm allocation for job 193231: Connection reset by peer srun: error: Unable to confirm allocation for job 193231: Connection reset by peer srun: error: Unable to confirm allocation for job 193231: Connection reset by peer srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 193231 srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 193231 srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 193231

Most likely, these jobs are lost Because for example jobs, we had 116 run successfully, 38 times the Connection reset by peer error, summing up to the total 154.

jluethi commented 6 months ago

Can now confirm that I run into this issue reproducibly when many jobs run at once. But the same workflow runs through fine if the cluster is busy and only 150-300 srun statements are run at once (e.g. I have 10 jobs with 154 srun statements, it runs fine if 2 of them are submitted together. But 5-8 submitted together => Connection reset by peer issue)

jluethi commented 6 months ago

Some updates on 1536 well plate testing: Adapting the server config settings to allow 50 slurm jobs instead of 10 slurm jobs to run at once got things running for this example (without adding the new sleep after sbatch mode). This lead to more queueing on slurm instead of the internal queues with the many srun calls at once.

Also, after the last slurm job finished, it took about 30s for Fractal web to mark the job as done. Given 1536 srun commands and some Slurm grace period for it writing all the files, that's quite nice performance!

Performance-wise: It's hard to judge with a fairly busy cluster as we've had during those tests. Converting the 1536 well plate, running 2 cellpose at full res & 2 measure features (not the fastest task due to slow texture measurements) took somewhere on the order of 10h. That's slower than I'd expect for 2D only data with just 1 FOV per well. But it's full res & slow measurements. Plus, a big part is due to busy cluster (e.g. Measurement task ran once in 2.3h, once in 35min depending on how busy the cluster was). Having to initialize cellpose models for 1 image per well was certainly an inefficiency though

tcompa commented 6 months ago

Useful scripts for testing:

single_task.sh

#!/bin/bash
DATE=$(date)
echo "DATE=$DATE 1=$1 2=$2" >> "dummy-output/$1_$2"
sleep 1

single_submit.sh

#!/bin/sh

#SBATCH --partition=REDACTED
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --mem=17G
#SBATCH --nodes=1
#SBATCH --job-name=test-154-srun
#SBATCH --err=slurm_%j.out
#SBATCH --out=slurm_%j.err

echo "Working directory (pwd): `pwd`"

srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=1GB ./single_task.sh $1 0 &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=1GB ./single_task.sh $1 1 &
...
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=1GB ./single_task.sh $1 152 &
srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=1GB ./single_task.sh $1 153 &
wait

many_submit.sh

#!/bin/bash

rm -r dummy-output
mkdir dummy-output

for INDEX in {1..10}; do
   date
   sbatch single_submit.sh $INDEX
done

How the issue appears

After running many_submit.sh, and waiting for the ten SLURM jobs to be over, one can count the number of files in dummy-output. The expected number is 1540, but in some cases we do observe a smaller number (e.g. 1529). In this case, the SLURM output files include logs like (for SLURM v15):

srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: Unable to create job step: Transport endpoint is not connected
srun: error: Unable to create job step: Transport endpoint is not connected
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: Unable to confirm allocation for job 10410780: Transport endpoint is not connected
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
srun: error: Unable to confirm allocation for job 10410780: Transport endpoint is not connected
srun: error: Unable to confirm allocation for job 10410780: Transport endpoint is not connected
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: Unable to confirm allocation for job 10410781: Transport endpoint is not connected
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: Unable to create job step: Transport endpoint is not connected
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: Unable to confirm allocation for job 10410784: Transport endpoint is not connected
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: Unable to create job step: Transport endpoint is not connected
srun: error: Unable to create job step: Transport endpoint is not connected
srun: error: io_init_msg_read too small
srun: error: failed reading io init message
srun: error: slurm_receive_msg: Transport endpoint is not connected
srun: error: Unable to create job step: Transport endpoint is not connected

tcompa commented 6 months ago

For the record, we hit again this issue while debugging #1507 with @mfranzon. In that case, the SLURM job was made of tens of sruns, and the SLURM node had very limited resources (2-4 CPUs and 2G of memory). The simultaneous submission of many sruns, together with their retries, lead to a communication error again (something related to socket not receiving SLURM messages).

tcompa commented 6 months ago

For https://github.com/fractal-analytics-platform/fractal-server/issues/1459#issuecomment-2120584669, we dealt with it by splitting the list of sruns into blocks, with several wait lines.

tcompa commented 4 months ago

Noting that we observed something similar again. This was with fractal-server 2.3.0a2, with about 1100 submission scripts like this

#!/bin/sh
#SBATCH --partition=main
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=2
#SBATCH --mem=60000M
#SBATCH --job-name=XXXX
#SBATCH --err=/XXXX/10_batch_001137_slurm_%j.err
#SBATCH --out=/XXX/10_batch_001137_slurm_%j.out
#SBATCH -D XXX

srun --ntasks=1 --cpus-per-task=$SLURM_CPUS_PER_TASK --mem=7500MB /XXX/server/fractal-server-env/bin/python -m fractal_server.app.runner.executors.slurm.remote --input-file /XXX/server/artifacts/XXX_in_Luti7Cjy0gJMNTDDNfXU90LVhC8vvRxh.pickle --output-file /XXX/XXX_out_Luti7Cjy0gJMNTDDNfXU90LVhC8vvRxh.pickle &
...
# seven more srun statements like this
wait

for a total of around 9000 images.

At some point, at least one job failed like this:

$ cat 10_batch_000675_slurm_10448428.err
slurmstepd: slurm_receive_msg: Socket timed out on send/recv operation

fractal-analytics-platform / fractal-server