dmwm / CRABServer

15 stars 38 forks source link

timed out FTS transfers are left in status SUBMITTED in transfersdb #6572

Closed belforte closed 3 years ago

belforte commented 3 years ago

I thought we had sort out all things in this area years, ago. But investigating current FTS mess I found that when PotJob hits the 24h timeout it fails to set status to KILLED in transfersdb

Wed, 28 Apr 2021 11:30:42 CEST(+0000):INFO:PostJob ====== Starting to cancel ongoing ASO transfers.
Wed, 28 Apr 2021 11:30:42 CEST(+0000):INFO:PostJob In case of cancellation failure, will retry up to 2 times.
Wed, 28 Apr 2021 11:30:42 CEST(+0000):INFO:PostJob Cancelling ASO transfer 69a29bfdb402c1d204f1d38850c97c7b2327caee1ad9c3cb90e9c281 with following reason: Cancelled ASO transfer after timeout of 86400 seconds.
Wed, 28 Apr 2021 11:30:42 CEST(+0000):ERROR:PostJob Failed to cancel 1 ASO transfers: 69a29bfdb402c1d204f1d38850c97c7b2327caee1ad9c3cb90e9c281
Wed, 28 Apr 2021 11:30:42 CEST(+0000):INFO:PostJob ====== Finished to cancel ongoing ASO transfers.
Wed, 28 Apr 2021 11:30:42 CEST(+0000):INFO:PostJob ====== Finished to monitor ASO transfers.
Wed, 28 Apr 2021 11:30:42 CEST(+0000):ERROR:PostJob Got fatal stageout exception:
Stageout failed with code 2.
Post-job timed out waiting for ASO transfers to complete.
Attempts were made to cancel the ongoing transfers, but cancellation failed for some transfers.
Considering cancellation failures as a permament stageout error.

Digging in, this is "as designed" due to https://github.com/dmwm/CRABServer/blob/8d87d579f4fa12cf460114c23a8dd3bd22945b23/src/python/CRABInterface/RESTFileUserTransfers.py#L169-L185 since it only kills transfers in NEW i.e. those for which an FTS jobs has not been submitted.

This is likely the reason for #6441 and #5891

A thorough understanding is (again) needed before changing, and we need to document things better. In particular I am wondering why xfer jobs get stuck for one day inside FTS given that we submit with https://github.com/dmwm/CRABServer/blob/8d87d579f4fa12cf460114c23a8dd3bd22945b23/scripts/task_process/FTS_Transfers.py#L291 and I clearly remember that when discussing this long ago with @dciangot we came to conclusion that there is no need to kill FTS transfers since they will be gone by themselves before the 24h timeout hit. Maybe they changed somethin in FTS :-( Here's an example of a single file FTS transfer job which is still in SUBMITTED after 6900 sec, in spite of the 600sec timeout indicated in the submissions Screenshot from 2021-04-28 16-05-49

belforte commented 3 years ago

AAAARGHHHHH !!! from https://fts3-docs.web.cern.ch/fts3-docs/fts-rest/docs/easy/submit.html Screenshot from 2021-04-28 16-10-34

the timeout indicated in https://github.com/dmwm/CRABServer/blob/8d87d579f4fa12cf460114c23a8dd3bd22945b23/scripts/task_process/FTS_Transfers.py#L291 is 600 hours !!!!! No wonder that it is not working !

belforte commented 3 years ago

in the meanwhile I have prepared a hot fix for production TW to set back FTS timeout to a saner 6h: https://github.com/dmwm/CRABServer/releases/tag/v3.210318patch1

see https://cern.service-now.com/service-portal?id=ticket&table=incident&n=INC2775059

belforte commented 3 years ago

I am very unhappy that I just created a hof fix to revert an hot fix done 3 years ago by @dciangot https://github.com/dmwm/CRABServer/commit/33e5a19cf9af41d12956ee3be3914a66204a6123 But after reviewing meeting minutes of the time, it looks like a review of FTS submission parameter was post-poned for a couple of weeks and then disappeared from minutes. Either was not done, or was not documented. So I have no way to figure out what the reasoning was at the time.

belforte commented 3 years ago

keeping it open until a good understanding of how FTS works is reached. REF: https://cern.service-now.com/service-portal?id=ticket&table=incident&n=INC2776329

belforte commented 3 years ago

here is the description form FTS experts, from https://cern.service-now.com/service-portal?id=ticket&table=incident&n=INC2776329

Hello Stefano,

I believe we can fine-tune the CRAB submissions. I'll do my best to answer these questions.

  1. Jobs follow the following state machine: https://fts3-docs.web.cern.ch/fts3-docs/docs/state_machine.html

For the most part, jobs can be viewed as a collection of transfers. Some parameters are applied to each file in particular (e.g.: user-given filesize, user-given checksum, activity, file metadata), but most parameters apply job-wide (and to all files within that job), such as retry, retry delay, ipv4/ipv6 preference, bringonline timeout, overwrite flag, max_time_in_queue, etc.

Jobs can be one of 4 types:

In the FTS Web monitoring page, there is a field showing the job type (one of N, H, R or Y).

Regarding the normal job type (which happens for the most part), scheduling is done on a per transfer basis (with the job status being more for informative purposes).

1a. A job's statuses moves from "in-queue" to something else when the job state moves from SUBMITTED to READY or ACTIVE. A job is finished when it reaches a terminal state: FINISHED, FAILED, FINISHEDDIRTY or CANCELED. For a job to reach a terminal state, all its transfers must be in a terminal state. A transfer may reach one of the following terminal states: FINISHED, FAILED, CANCELED.

1b. The max_time_in_queue is a timeout for how much the job can stay in a SUBMITTED, ACTIVE or STAGING state. Behind the scenes, the FTS-REST parses the max_time_in_queue and assigns a termination_timestamp to that job. There is a canceler service which scans the database for jobs in SUBMITTED/ ACTIVE / STAGING state with termination_timestamp < NOW(). When such a job is found, the job and all of its transfers are marked as CANCELED.

During submission, the argument is expressed in hours (default, for backwards compatibility). It may also be expressed with a suffix: 's', 'm', 'h' (uppercase or lowercase makes no difference) E.g.: max_time_in_queue = 6 <--> max_time_in_queue = 360m

  1. When a job's max_time_in_queue is reached, every transfer not in a terminal state goes to CANCELED. The job will also be marked as CANCELED.

  2. The best way to get rid of stuck transfers is to assign them an accurate max_time_in_queue. Given that ASO data is "transient", this is a good use case. Alternatively, there's is a global timeout, configured in the FTS server per VO (default 168 hours). If a job assigned to that VO (in one of SUBMITTED / ACTIVE / STAGING state) has passed the global timeout interval from its submission timestamp, it gets canceled by the FTS server.

  3. If you have the job ids, you can send a job cancellation request. You may give the job id (cancel all files) or the job id + a list of file ids (cancel only specific files): https://fts3-docs.web.cern.ch/fts3-docs/fts-rest/docs/api.html#delete-jobsjobidlist

This is a synchronous operation and is executed right away.

I can point you to the fts-rest-transfer-cancel CLI tool as an example: https://gitlab.cern.ch/fts/fts-rest/-/blob/v3.10.1/src/fts3/cli/jobcanceller.py#L50

  1. The number of retries applies to each transfer within that job. A transfer is granted the first execution + number_of_retries. E.g.: retry=3 --> first execution + 3 retries

The max_time_in_queue applies per job, not per retry. Using a timeout of 6h with 3 retries would mean an expected "average wait time" of 1.5hours / transfer attempt. (Of course, in reality, it doesn't work that way. The transfer will stay in SUBMITTED until scheduled. If max_time_in_queue or the global timeout is reached, then the job gets cancelled)

When a transfer is failed, but retries are still available, it goes back to SUBMITTED state. You can start to intuitively see that no particular priority is given to transfers that have had retries.

  1. When using the normal job type, perhaps not so much, as transfer scheduling is done a per transfer basis. In this situation, the job only serves as a collection of transfers. However, in organizing your data, it may be preferrable to group transfers in jobs as opposed to 1 transfer / job.

Other considerations

I. Reuse There is the Reuse job type. If you're dealing with many small files between the same and , I would recommend to give this feature a go: https://fts3-docs.web.cern.ch/fts3-docs/docs/session_reuse.html Behind the scenes, this feature leverages the gfal2_copy_bulk() functionality.

During the submission, set the reuse flag to True: https://fts3-docs.web.cern.ch/fts3-docs/fts-rest/docs/easy/submit.html#group-transfers-in-a-job

I suggest we do some test runs for ASO use-case on FTS3-Pilot before using this submission model in CMS production.

II. Priority and activity shares The FTS Scheduler takes two parameters into account:

The priority mechanism is rather simple. Each transfer is assigned a priority by the user at submission time. When FTS schedules jobs, it picks the ones matching the highest priority.

The activity shares are more interesting. Activity is a user-assigned label at submission time, when creating a new transfer. FTS will try to schedule jobs in a weighted-share, according to the configured acitivity weights.

Unfortunately, FTS3-CMS instance uses the same priority = 3 for all jobs and activity shares are not set up.

You may assign a higher piorities to ASO jobs, but it may not be seen as fair play. Also, the priority system can lead to starvation, so use with care. (however, as seen 3 weeks ago, a very high queue also leads to starvation)

I would encourage to set up the Activity Shares system on the FTS3-CMS system for better scheduling.

Hope this would prove useful (even if a bit overwhelming).

Cheers, Mihai

erratum: About the reuse, the underlying mechanism is not in fact gfal2_copy_bulk(), but some particular GridFTP flags within Gfal2. This means that Reuse, in its current form, only makes sense for the GridFTP protocol.

belforte commented 3 years ago

In light of the FTS behavior explanations and of https://github.com/dmwm/CRABServer/pull/6573 this condition should not happen anymore. I will open another less urgent issue about improving FTS use by ASO in light of the explanations from Mihai in https://cern.service-now.com/service-portal?id=ticket&table=incident&n=INC2776329