cybergis / cybergis-compute-python-sdk

https://cybergis.github.io/cybergis-compute-python-sdk
Apache License 2.0
5 stars 7 forks source link

add specification for allocation and partition #88

Open JTSIV1 opened 6 months ago

JTSIV1 commented 6 months ago

Proposed Changes

Dependencies

alexandermichels commented 1 month ago

If the allocation/partition is nonsense (typo) the HPC will error, but this error doesn't reach the user. Can we fix this.

Example from the Core logs:

hello_world_half_hour not out of date, skipping update
hello_world_half_hour not out of date, skipping update
Assertion failed: 'Variable is undefined/null when it should not be. Assertion at Error, SingularityConnector.prepare: 78'
Assertion failed: 'Variable is undefined/null when it should not be. Assertion at Error, SingularityConnector.prepare: 79'
1726252408LZLIw: [event] SLURM_UPLOAD_EXECUTABLE uploading executable folder
1726252408LZLIw: [event] SSH_UNZIP unzipping /anvil/scratch/x-cybergis/compute/cache/hello_world_half_hour.zip to /anvil/scratch/x-cybergis/compute/1726252420QkBrQ
1726252408LZLIw: [event] SLURM_CREATE_RESULT create result folder
1726252408LZLIw: [event] SSH_MKDIR removing /anvil/scratch/x-cybergis/compute/1726252422Sa9P4/slurm_log
1726252408LZLIw: [event] SSH_SCP_UPLOAD put file from /job_supervisor/data/tmp/tmp-dywyhaco3v to /anvil/scratch/x-cybergis/compute/1726252420QkBrQ/job.sbatch
1726252408LZLIw: [event] SSH_CREATE_FILE create file to /anvil/scratch/x-cybergis/compute/1726252420QkBrQ/job.json
1726252408LZLIw: [event] SSH_SCP_UPLOAD put file from /job_supervisor/data/tmp/tmp-n6hi7udypb to /anvil/scratch/x-cybergis/compute/1726252420QkBrQ/job.json
1726252408LZLIw: [event] SLURM_SUBMIT submitting slurm job
1726252408LZLIw: [event] SLURM_SUBMIT_ERROR cannot submit job 1726252408LZLIw: {"stdout":null,"stderr":"sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified\n"}
1726252408LZLIw: [event] JOB_RETRY job [1726252408LZLIw] encountered system error ConnectorError: cannot submit job 1726252408LZLIw: {"stdout":null,"stderr":"sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified\n"}
1726252408LZLIw: [event] JOB_FAILED initialization counter exceeds 3 counts
alexandermichels commented 1 month ago

Can we also add a tooltip or disclaimer near the allocation/partition boxes that most users should leave them blank and that we have default options.