Creating the job works, but submitting this on a slurm cluster via the SubmitJob endpoint results in an error response 500: Problem Problem occured! Contact the administrators..
Checking the API logs shows:
INFO 2024-06-03 16:04:26 HEAppE.BusinessLogicTier.Logic.JobManagement.JobManagementLogic - User <username> is submitting the job with info Id 117
ERROR 2024-06-03 16:04:26 HEAppE.RestApi.ExceptionMiddleware - Sequence contains no elements
System.InvalidOperationException: Sequence contains no elements
at System.Linq.ThrowHelper.ThrowNoElementsException()
at System.Linq.Enumerable.First[TSource](IEnumerable`1 source)
at HEAppE.HpcConnectionFramework.SchedulerAdapters.Slurm.Generic.ConversionAdapter.SlurmTaskAdapter.PrepareNameOfNodes(ICollection`1 requestedNodeGroups, Int32 nodeCount) in /src/HpcConnectionFramework/SchedulerAdapters/Slurm/Generic/ConversionAdapter/SlurmTaskAdapter.cs:line 353
at HEAppE.HpcConnectionFramework.SchedulerAdapters.Slurm.Generic.ConversionAdapter.SlurmTaskAdapter.SetRequestedResourceNumber(IEnumerable`1 requestedNodeGroups, ICollection`1 requiredNodes, String placementPolicy, IEnumerable`1 paralizationSpecs, Int32 minCores, Int32 maxCores, Int32 coresPerNode) in /src/HpcConnectionFramework/SchedulerAdapters/Slurm/Generic/ConversionAdapter/SlurmTaskAdapter.cs:line 262
at HEAppE.HpcConnectionFramework.SchedulerAdapters.SchedulerDataConvertor.ConvertTaskSpecificationToTask(JobSpecification jobSpecification, TaskSpecification taskSpecification, Object schedulerAllocationCmd) in /src/HpcConnectionFramework/SchedulerAdapters/SchedulerDataConvertor.cs:line 91
at HEAppE.HpcConnectionFramework.SchedulerAdapters.Slurm.Generic.SlurmDataConvertor.ConvertJobSpecificationToJob(JobSpecification jobSpecification, Object schedulerAllocationCmd) in /src/HpcConnectionFramework/SchedulerAdapters/Slurm/Generic/SlurmDataConvertor.cs:line 185
at HEAppE.HpcConnectionFramework.SchedulerAdapters.Slurm.Generic.SlurmSchedulerAdapter.SubmitJob(Object connectorClient, JobSpecification jobSpecification, ClusterAuthenticationCredentials credentials) in /src/HpcConnectionFramework/SchedulerAdapters/Slurm/Generic/SlurmSchedulerAdapter.cs:line 71
at HEAppE.HpcConnectionFramework.SchedulerAdapters.RexSchedulerWrapper.SubmitJob(JobSpecification jobSpecification, ClusterAuthenticationCredentials credentials) in /src/HpcConnectionFramework/SchedulerAdapters/RexSchedulerWrapper.cs:line 61
at HEAppE.BusinessLogicTier.Logic.JobManagement.JobManagementLogic.SubmitJob(Int64 createdJobInfoId, AdaptorUser loggedUser) in /src/BusinessLogicTier/Logic/JobManagement/JobManagementLogic.cs:line 133
at HEAppE.ServiceTier.JobManagement.JobManagementService.SubmitJob(Int64 createdJobInfoId, String sessionCode) in /src/ServiceTier/JobManagement/JobManagementService.cs:line 49
at HEAppE.RestApi.Controllers.JobManagementController.SubmitJob(SubmitJobModel model) in /src/RestApi/Controllers/JobManagementController.cs:line 82
at lambda_method2342(Closure, Object, Object[])
at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.SyncActionResultExecutor.Execute(ActionContext actionContext, IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeActionMethodAsync()
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeNextActionFilterAsync()
--- End of stack trace from previous location ---
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeInnerFilterAsync()
--- End of stack trace from previous location ---
at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeNextResourceFilter>g__Awaited|25_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Rethrow(ResourceExecutedContextSealed context)
at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.InvokeFilterPipelineAsync()
--- End of stack trace from previous location ---
at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Awaited|17_0(ResourceInvoker invoker, Task task, IDisposable scope)
at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Awaited|17_0(ResourceInvoker invoker, Task task, IDisposable scope)
at Microsoft.AspNetCore.Routing.EndpointMiddleware.<Invoke>g__AwaitRequestTask|6_0(Endpoint endpoint, Task requestTask, ILogger logger)
at HEAppE.RestApi.ExceptionMiddleware.InvokeAsync(HttpContext context) in /src/RestApi/ExceptionMiddleware.cs:line 72
SlurmTaskAdapter.SetRequestedResourceNumber should reject a nodeCount argument value of 0
Computation of nodeCount in SlurmTaskAdapter.SetRequestedResourceNumber should be robust to maxCores not being set (which probably means a better default value has to be provided from a calling method (maybe maxCores = minCores in this case) OR during task creation there should be a validation error if MaxCores is not set in the request
Finally, I don't really understand what MinCores and MaxCores actually relate to in the request - they suggest that the job can get some kind of variable number of cores between these limits, but why? Is that based on what resources are available on the cluster at the time of submission? This does not appear to be how HPC scheduler usually work, so I find this a bit confusing. Also there should probably be a list of which arguments are required and which are optional, this isn't entirely clear to me at the moment. Especially when it comes to the LogFile, ProgressFile and how they differ from the StandardOutputFile and why they are apparently mandatory is at this point unclear to me, as they don't seem to be actually created during running the job.
Another bug I just ran into which occurs when a job is submitted that contains a single task which does set
MinCores
but notMaxCores
, e.g.,Creating the job works, but submitting this on a slurm cluster via the SubmitJob endpoint results in an error response
500: Problem Problem occured! Contact the administrators.
.Checking the API logs shows:
It appears that SlurmTaskAdapter.SetRequestedResourceNumber line 258/259 this results in a
nodeCount
of0
, which causes the conditionalif (requestedNodeGroups?.Count == nodeCount)
in SlurmTaskAdapter.PrepareNameOfNodes to evaluate totrue
(for an empty list ofrequestedNodeGroups
), which in turn causes the exception when trying to access theFirst
item in that collection in line 353.The following fixes would be advisable:
SlurmTaskAdapter.SetRequestedResourceNumber
should reject anodeCount
argument value of 0nodeCount
inSlurmTaskAdapter.SetRequestedResourceNumber
should be robust tomaxCores
not being set (which probably means a better default value has to be provided from a calling method (maybemaxCores = minCores
in this case) OR during task creation there should be a validation error ifMaxCores
is not set in the requestFinally, I don't really understand what
MinCores
andMaxCores
actually relate to in the request - they suggest that the job can get some kind of variable number of cores between these limits, but why? Is that based on what resources are available on the cluster at the time of submission? This does not appear to be how HPC scheduler usually work, so I find this a bit confusing. Also there should probably be a list of which arguments are required and which are optional, this isn't entirely clear to me at the moment. Especially when it comes to theLogFile
,ProgressFile
and how they differ from theStandardOutputFile
and why they are apparently mandatory is at this point unclear to me, as they don't seem to be actually created during running the job.