It4innovations / HEAppE

GNU General Public License v3.0
12 stars 0 forks source link

Internal server error when submitting a job with a task that doesn't set `MaxCores` #3

Open lupreCSC opened 5 months ago

lupreCSC commented 5 months ago

Another bug I just ran into which occurs when a job is submitted that contains a single task which does set MinCores but not MaxCores, e.g.,

{
    "Name": "heappe_job",
    "ClusterId": 1,
    "ProjectId": 1,
    "FileTransferMethodId": 1,
    "Tasks": [
        {
            "Name": "task_1",
            "MinCores": 1,
            "WalltimeLimit": 600,
            "Priority": 4,
            "StandardOutputFile": "stdout.log",
            "StandardErrorFile": "stderr.log",
            "LogFile": "stdlog",
            "ProgressFile": "progress",
            "ClusterNodeTypeId": 1,
            "CommandTemplateId": 1,
            "TemplateParameterValue": [
                {
                    "CommandParameterIdentifier": "inputParam",
                    "ParameterValue": "testValue"
                }
            ],
        },
    ]
}

Creating the job works, but submitting this on a slurm cluster via the SubmitJob endpoint results in an error response 500: Problem Problem occured! Contact the administrators..

Checking the API logs shows:

INFO  2024-06-03 16:04:26 HEAppE.BusinessLogicTier.Logic.JobManagement.JobManagementLogic - User <username> is submitting the job with info Id 117 
ERROR 2024-06-03 16:04:26 HEAppE.RestApi.ExceptionMiddleware - Sequence contains no elements 
System.InvalidOperationException: Sequence contains no elements
   at System.Linq.ThrowHelper.ThrowNoElementsException()
   at System.Linq.Enumerable.First[TSource](IEnumerable`1 source)
   at HEAppE.HpcConnectionFramework.SchedulerAdapters.Slurm.Generic.ConversionAdapter.SlurmTaskAdapter.PrepareNameOfNodes(ICollection`1 requestedNodeGroups, Int32 nodeCount) in /src/HpcConnectionFramework/SchedulerAdapters/Slurm/Generic/ConversionAdapter/SlurmTaskAdapter.cs:line 353
   at HEAppE.HpcConnectionFramework.SchedulerAdapters.Slurm.Generic.ConversionAdapter.SlurmTaskAdapter.SetRequestedResourceNumber(IEnumerable`1 requestedNodeGroups, ICollection`1 requiredNodes, String placementPolicy, IEnumerable`1 paralizationSpecs, Int32 minCores, Int32 maxCores, Int32 coresPerNode) in /src/HpcConnectionFramework/SchedulerAdapters/Slurm/Generic/ConversionAdapter/SlurmTaskAdapter.cs:line 262
   at HEAppE.HpcConnectionFramework.SchedulerAdapters.SchedulerDataConvertor.ConvertTaskSpecificationToTask(JobSpecification jobSpecification, TaskSpecification taskSpecification, Object schedulerAllocationCmd) in /src/HpcConnectionFramework/SchedulerAdapters/SchedulerDataConvertor.cs:line 91
   at HEAppE.HpcConnectionFramework.SchedulerAdapters.Slurm.Generic.SlurmDataConvertor.ConvertJobSpecificationToJob(JobSpecification jobSpecification, Object schedulerAllocationCmd) in /src/HpcConnectionFramework/SchedulerAdapters/Slurm/Generic/SlurmDataConvertor.cs:line 185
   at HEAppE.HpcConnectionFramework.SchedulerAdapters.Slurm.Generic.SlurmSchedulerAdapter.SubmitJob(Object connectorClient, JobSpecification jobSpecification, ClusterAuthenticationCredentials credentials) in /src/HpcConnectionFramework/SchedulerAdapters/Slurm/Generic/SlurmSchedulerAdapter.cs:line 71
   at HEAppE.HpcConnectionFramework.SchedulerAdapters.RexSchedulerWrapper.SubmitJob(JobSpecification jobSpecification, ClusterAuthenticationCredentials credentials) in /src/HpcConnectionFramework/SchedulerAdapters/RexSchedulerWrapper.cs:line 61
   at HEAppE.BusinessLogicTier.Logic.JobManagement.JobManagementLogic.SubmitJob(Int64 createdJobInfoId, AdaptorUser loggedUser) in /src/BusinessLogicTier/Logic/JobManagement/JobManagementLogic.cs:line 133
   at HEAppE.ServiceTier.JobManagement.JobManagementService.SubmitJob(Int64 createdJobInfoId, String sessionCode) in /src/ServiceTier/JobManagement/JobManagementService.cs:line 49
   at HEAppE.RestApi.Controllers.JobManagementController.SubmitJob(SubmitJobModel model) in /src/RestApi/Controllers/JobManagementController.cs:line 82
   at lambda_method2342(Closure, Object, Object[])
   at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.SyncActionResultExecutor.Execute(ActionContext actionContext, IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeActionMethodAsync()
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeNextActionFilterAsync()
--- End of stack trace from previous location ---
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.InvokeInnerFilterAsync()
--- End of stack trace from previous location ---
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeNextResourceFilter>g__Awaited|25_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Rethrow(ResourceExecutedContextSealed context)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.InvokeFilterPipelineAsync()
--- End of stack trace from previous location ---
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Awaited|17_0(ResourceInvoker invoker, Task task, IDisposable scope)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Awaited|17_0(ResourceInvoker invoker, Task task, IDisposable scope)
   at Microsoft.AspNetCore.Routing.EndpointMiddleware.<Invoke>g__AwaitRequestTask|6_0(Endpoint endpoint, Task requestTask, ILogger logger)
   at HEAppE.RestApi.ExceptionMiddleware.InvokeAsync(HttpContext context) in /src/RestApi/ExceptionMiddleware.cs:line 72

It appears that SlurmTaskAdapter.SetRequestedResourceNumber line 258/259 this results in a nodeCount of 0, which causes the conditional if (requestedNodeGroups?.Count == nodeCount) in SlurmTaskAdapter.PrepareNameOfNodes to evaluate to true (for an empty list of requestedNodeGroups), which in turn causes the exception when trying to access the First item in that collection in line 353.

The following fixes would be advisable:

Finally, I don't really understand what MinCores and MaxCores actually relate to in the request - they suggest that the job can get some kind of variable number of cores between these limits, but why? Is that based on what resources are available on the cluster at the time of submission? This does not appear to be how HPC scheduler usually work, so I find this a bit confusing. Also there should probably be a list of which arguments are required and which are optional, this isn't entirely clear to me at the moment. Especially when it comes to the LogFile, ProgressFile and how they differ from the StandardOutputFile and why they are apparently mandatory is at this point unclear to me, as they don't seem to be actually created during running the job.

jkonvicka commented 2 months ago

Hi @lupreCSC, we are looking into it.