Closed belforte closed 3 years ago
AAAARGHHHHH !!! from https://fts3-docs.web.cern.ch/fts3-docs/fts-rest/docs/easy/submit.html
the timeout indicated in https://github.com/dmwm/CRABServer/blob/8d87d579f4fa12cf460114c23a8dd3bd22945b23/scripts/task_process/FTS_Transfers.py#L291 is 600 hours !!!!! No wonder that it is not working !
in the meanwhile I have prepared a hot fix for production TW to set back FTS timeout to a saner 6h: https://github.com/dmwm/CRABServer/releases/tag/v3.210318patch1
see https://cern.service-now.com/service-portal?id=ticket&table=incident&n=INC2775059
I am very unhappy that I just created a hof fix to revert an hot fix done 3 years ago by @dciangot https://github.com/dmwm/CRABServer/commit/33e5a19cf9af41d12956ee3be3914a66204a6123 But after reviewing meeting minutes of the time, it looks like a review of FTS submission parameter was post-poned for a couple of weeks and then disappeared from minutes. Either was not done, or was not documented. So I have no way to figure out what the reasoning was at the time.
keeping it open until a good understanding of how FTS works is reached. REF: https://cern.service-now.com/service-portal?id=ticket&table=incident&n=INC2776329
here is the description form FTS experts, from https://cern.service-now.com/service-portal?id=ticket&table=incident&n=INC2776329
Hello Stefano,
I believe we can fine-tune the CRAB submissions. I'll do my best to answer these questions.
For the most part, jobs can be viewed as a collection of transfers. Some parameters are applied to each file in particular (e.g.: user-given filesize, user-given checksum, activity, file metadata), but most parameters apply job-wide (and to all files within that job), such as retry, retry delay, ipv4/ipv6 preference, bringonline timeout, overwrite flag, max_time_in_queue, etc.
Jobs can be one of 4 types:
In the FTS Web monitoring page, there is a field showing the job type (one of N, H, R or Y).
Regarding the normal job type (which happens for the most part), scheduling is done on a per transfer basis (with the job status being more for informative purposes).
1a. A job's statuses moves from "in-queue" to something else when the job state moves from SUBMITTED
to READY
or ACTIVE
.
A job is finished when it reaches a terminal state: FINISHED
, FAILED
, FINISHEDDIRTY
or CANCELED
.
For a job to reach a terminal state, all its transfers must be in a terminal state.
A transfer may reach one of the following terminal states: FINISHED
, FAILED
, CANCELED
.
1b. The max_time_in_queue
is a timeout for how much the job can stay in a SUBMITTED
, ACTIVE
or STAGING
state.
Behind the scenes, the FTS-REST parses the max_time_in_queue
and assigns a termination_timestamp to that job.
There is a canceler service which scans the database for jobs in SUBMITTED
/ ACTIVE
/ STAGING
state with termination_timestamp < NOW().
When such a job is found, the job and all of its transfers are marked as CANCELED
.
During submission, the argument is expressed in hours (default, for backwards compatibility). It may also be expressed with a suffix: 's', 'm', 'h' (uppercase or lowercase makes no difference) E.g.: max_time_in_queue = 6 <--> max_time_in_queue = 360m
When a job's max_time_in_queue is reached, every transfer not in a terminal state goes to CANCELED
. The job will also be marked as CANCELED
.
The best way to get rid of stuck transfers is to assign them an accurate max_time_in_queue. Given that ASO data is "transient", this is a good use case.
Alternatively, there's is a global timeout, configured in the FTS server per VO (default 168 hours).
If a job assigned to that VO (in one of SUBMITTED
/ ACTIVE
/ STAGING
state) has passed the global timeout interval from its submission timestamp, it gets canceled by the FTS server.
If you have the job ids, you can send a job cancellation request. You may give the job id (cancel all files) or the job id + a list of file ids (cancel only specific files): https://fts3-docs.web.cern.ch/fts3-docs/fts-rest/docs/api.html#delete-jobsjobidlist
This is a synchronous operation and is executed right away.
I can point you to the fts-rest-transfer-cancel
CLI tool as an example:
https://gitlab.cern.ch/fts/fts-rest/-/blob/v3.10.1/src/fts3/cli/jobcanceller.py#L50
The max_time_in_queue
applies per job, not per retry. Using a timeout of 6h with 3 retries would mean an expected "average wait time" of 1.5hours / transfer attempt.
(Of course, in reality, it doesn't work that way. The transfer will stay in SUBMITTED
until scheduled. If max_time_in_queue or the global timeout is reached, then the job gets cancelled)
When a transfer is failed, but retries are still available, it goes back to SUBMITTED
state.
You can start to intuitively see that no particular priority is given to transfers that have had retries.
I. Reuse
There is the Reuse job type. If you're dealing with many small files between the same
During the submission, set the reuse
flag to True:
https://fts3-docs.web.cern.ch/fts3-docs/fts-rest/docs/easy/submit.html#group-transfers-in-a-job
I suggest we do some test runs for ASO use-case on FTS3-Pilot before using this submission model in CMS production.
II. Priority and activity shares The FTS Scheduler takes two parameters into account:
The priority mechanism is rather simple. Each transfer is assigned a priority by the user at submission time. When FTS schedules jobs, it picks the ones matching the highest priority.
The activity shares are more interesting. Activity is a user-assigned label at submission time, when creating a new transfer. FTS will try to schedule jobs in a weighted-share, according to the configured acitivity weights.
Unfortunately, FTS3-CMS instance uses the same priority = 3 for all jobs and activity shares are not set up.
You may assign a higher piorities to ASO jobs, but it may not be seen as fair play. Also, the priority system can lead to starvation, so use with care. (however, as seen 3 weeks ago, a very high queue also leads to starvation)
I would encourage to set up the Activity Shares system on the FTS3-CMS system for better scheduling.
Hope this would prove useful (even if a bit overwhelming).
Cheers, Mihai
erratum:
About the reuse, the underlying mechanism is not in fact gfal2_copy_bulk()
, but some particular GridFTP flags within Gfal2.
This means that Reuse, in its current form, only makes sense for the GridFTP protocol.
In light of the FTS behavior explanations and of https://github.com/dmwm/CRABServer/pull/6573 this condition should not happen anymore. I will open another less urgent issue about improving FTS use by ASO in light of the explanations from Mihai in https://cern.service-now.com/service-portal?id=ticket&table=incident&n=INC2776329
I thought we had sort out all things in this area years, ago. But investigating current FTS mess I found that when PotJob hits the 24h timeout it fails to set status to KILLED in transfersdb
Digging in, this is "as designed" due to https://github.com/dmwm/CRABServer/blob/8d87d579f4fa12cf460114c23a8dd3bd22945b23/src/python/CRABInterface/RESTFileUserTransfers.py#L169-L185 since it only kills transfers in
NEW
i.e. those for which an FTS jobs has not been submitted.This is likely the reason for #6441 and #5891
A thorough understanding is (again) needed before changing, and we need to document things better. In particular I am wondering why xfer jobs get stuck for one day inside FTS given that we submit with https://github.com/dmwm/CRABServer/blob/8d87d579f4fa12cf460114c23a8dd3bd22945b23/scripts/task_process/FTS_Transfers.py#L291 and I clearly remember that when discussing this long ago with @dciangot we came to conclusion that there is no need to kill FTS transfers since they will be gone by themselves before the 24h timeout hit. Maybe they changed somethin in FTS :-( Here's an example of a single file FTS transfer job which is still in SUBMITTED after 6900 sec, in spite of the 600sec timeout indicated in the submissions