The jobs that are held by condor due to issues need to be communicated to the client with the status message/email. Available options are allowing the user to cancel the other jobs and return with the logs (in case of a power user/debug enabled); cancel the held jobs (from condor) with a descriptive file in the result about them and proceed with other jobs.
Also, the jobs that are directly managed through condor - held, stopped etc, don't propagate their status to the DB and EMS keeps querying them over and over leading to a performance issue. Without maintenance clearing of the DB, this leads to condor_history using lots of CPU. Solving the above issue needs to be done in a way that this one is avoided. This particular issue could be fixed by modifying https://github.com/GRAPLE/GWS/blob/master/ems.py#L140 process_once function to also account for held jobs (make up a new experiment status - 'held'/'error').
The jobs that are held by condor due to issues need to be communicated to the client with the status message/email. Available options are allowing the user to cancel the other jobs and return with the logs (in case of a power user/debug enabled); cancel the held jobs (from condor) with a descriptive file in the result about them and proceed with other jobs.
Also, the jobs that are directly managed through condor - held, stopped etc, don't propagate their status to the DB and EMS keeps querying them over and over leading to a performance issue. Without maintenance clearing of the DB, this leads to condor_history using lots of CPU. Solving the above issue needs to be done in a way that this one is avoided. This particular issue could be fixed by modifying https://github.com/GRAPLE/GWS/blob/master/ems.py#L140 process_once function to also account for held jobs (make up a new experiment status - 'held'/'error').