Open mapellidario opened 1 week ago
thanks Dario. I do not think it is worth to invest time in making the script rock solid. It will be enough to capture the "how to debug" info in the documentation. I was worrying that it could be due to the new container. But a failure like this will be easy to debug if it happens again. I'd rather have it failing badly when tasks
is empty and figure out why it was
It will be enough to capture the "how to debug" info in the documentation
it's already there :) [1]
No, it does not seem to be caused by the new container, that I have not deployed to crab-prod-tw02 yet.
So, shall we keep this issue open or do you want to close it and open a new one if and when it fails again?
[1] https://cmscrab.docs.cern.ch/technical/crab-monitoring/crab-crontabs.html#how-to-debug
hmm.. I did not get the mail. But I can't review doc in detail now. Let's leave this open for a while. IN case it happens again and we can find details. I am putting on hold, until we can reproduce it :-)
hmm.. I did not get the mail.
yeah, that's weird as well, i have no idea why
Yesterday for a few hours the monitoring script that checks tape recalls failed multiple times [1]. It started working again without any human intervention.
Every time a monitorin script fails, the stdout is saved in a file [2].
Yesterday, the script failed at the line https://github.com/dmwm/CRABServer/blob/bfdeedbebd1e34688eca0e88991a080debd83fb4/scripts/Utils/CheckTapeRecall.py#L194
likely because
x.tasks[-1]
was empty, see [3].I have not checked what the cause could have been, maybe we can improve the script so that is it prints a better message.
[1]
[2]
[3]
cat /tmp/monit-070557ac-2eaf-417a-93b5-1214956f2a51.txt
```plaintext [crab3@crab-prod-tw02 ~]$ cat /tmp/monit-070557ac-2eaf-417a-93b5-1214956f2a51.txt ln: failed to create symbolic link '/data/srv/monit/logs/logs': File exists 281 rules exist for account: crab_tape_recall state OK 255 REPLICATING 3 STUCK 23 dtype: int64 finding tape source for all pending rules (takes some time...) Add (DBS) dataset name ... ... and size Done! find tasks using these rules Done! Traceback (most recent call last): File "/data/srv/monit/CheckTapeRecall.py", line 440, in