Since now we have more than one DAG in a task for long living tasks we can easily reach the point where the subDAGs have finished in a period longer than 48 hours while the task is still running. And since the entanglement between TP and DAG is now done through the CRAB_Reqname instead of clusterid as it was before those dags immediately fall under this restriction and being killed.
The number of killed TPs per schedd since 10.06.2018 as an example is [1]. Exactly the fact that we were seing those few tasks killed by the script was the reason that were leaving it run - for just in case purpouses.
Maybe a quick review of the PRs to the Automattic spliting (just for reminding if we fully completed the issue) and a total disabling of the script will be sufficient.
Since now we have more than one DAG in a task for long living tasks we can easily reach the point where the subDAGs have finished in a period longer than 48 hours while the task is still running. And since the entanglement between TP and DAG is now done through the CRAB_Reqname instead of clusterid as it was before those dags immediately fall under this restriction and being killed.
The number of killed TPs per schedd since 10.06.2018 as an example is [1]. Exactly the fact that we were seing those few tasks killed by the script was the reason that were leaving it run - for just in case purpouses. Maybe a quick review of the PRs to the Automattic spliting (just for reminding if we fully completed the issue) and a total disabling of the script will be sufficient.