Closed unkcpz closed 2 months ago
Hi, thanks for your report! I have reproduced it locally. It happens when you request the same resource multiple times, e.g. this also crashes:
hq submit --resource foo=1 --resource foo=2 ls
We will fix the input validation so that it doesn't crash so late and produces a better error message (and of course to also avoid crashing the server :laughing:), but in any case, this is unsupported (the above should just be --resource foo=3
).
What submit command did you use? Did you use the CLI, the TOML file or Python API for submitting the job?
I use CLI to submit the job. The command I ran is:
hq job submit --name="aiida-1638" --stdout=_scheduler-stdout.txt --stderr=_scheduler-stderr.txt --time-request=3600s --time-limit=3600s --cpus=32 --resource mem=120000 ./_aiidasubmit.sh
I start the server and have an auto-alloc
The script content is:
#!/bin/bash
module load cray/22.05 cpeIntel/22.05
module load QuantumESPRESSO/7.0
export OMP_NUM_THREADS=1
srun --cpu-bind=map_cpu:$HQ_CPUS '-s' '-n' '32' '--mem' '120000' '/capstor/apps/cscs/eiger/easybuild/software/QuantumESPRESSO/7.0-cpeIntel-22.05/bin/pw.x' '-npool' '4' '-in' 'aiida.in' > 'aiida.out'
The initial crash is coming from the worker is not accessible since it uses the hostname
to connect to the login node but the supercomputer center has <hostname>.<domain>
to connect. Then it is the issue above. I didn't submit the job twice.
First, thanks for the nice tool! We used in with
aiida-hyperqueue
.\I am not sure if it is proper to paste the whole trace of error in the issue, I follow the instruction from the error message. Let me know if I need provide more for debugging.
HyperQueue version: v0.18.0