dmwm / CRABServer

16 stars 39 forks source link

add exit code 10040 (and others) to list of recoverable errors #8593

Open belforte opened 3 months ago

belforte commented 3 months ago

currently we only retry this https://github.com/dmwm/CRABServer/blob/6852e951c6c661e7290ea58ed9cb03ec69c56ace/src/python/TaskWorker/Actions/RetryJob.py#L381-L384 but there are many other possible exit codes which signal very high likelyhood of site error and are worth retrying in the twiki [1]

even if some of those are now obsolete (i.e. we do not have code which could rise them in our JobWrapper anymore)

[1] possibe site-related error are marked with the attention sign : image

reference: https://cms-talk.web.cern.ch/t/strange-error-on-crab-site-misconfiguration/45226/3

aspiringmind-code commented 3 months ago

@belforte Should I have a separate message for each of the site errors or can I club some of the errors, say for exit codes 60315, 60321, 60311, we can have the same error message called "Site Error:Stage-out related troubles"

belforte commented 3 months ago

for the time being you should simply add this to the list and put it at lower priority than the other things.

I need to check where we use that message !

On 07/08/2024 16:41, aspiringmind-code wrote:

@belforte https://github.com/belforte Should I have a separate message for each of the site errors or can I club some of the errors, say for exit codes 60315, 60321, 60311, we can have the same error message called "Site Error:Stage-out related troubles"

— Reply to this email directly, view it on GitHub https://github.com/dmwm/CRABServer/issues/8593#issuecomment-2273642592, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOAVWOE7ALI25LZXN7DHBDZQIWZXAVCNFSM6AAAAABMEOGOHSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZTGY2DENJZGI. You are receiving this because you were mentioned.Message ID: @.***>

--------------DSh4rWnQkVnstMjKuWxf3Coh Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit

<!DOCTYPE html>

for the time being you should simply add this to the list and put it at lower priority than the other things.

I need to check where we use that message !

On 07/08/2024 16:41, aspiringmind-code wrote:

Should I have a separate message for each of the site errors or can I club some of the errors, say for exit codes 60315, 60321, 60311, we can have the same error message called "Site Error:Stage-out related troubles"


Reply to this email directly,
view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: <dmwm/CRABServer/issues/8593/2273642592@github.com>

--------------DSh4rWnQkVnstMjKuWxf3Coh--