dmwm / AsyncStageout

6 stars 10 forks source link

Report transfers errors in files_database in couch #1113

Closed DMWMBot closed 11 years ago

DMWMBot commented 13 years ago

Currently, the reason of FTS transfer failure can be seen only in transfers log files. Files_database reports only the status of failed transfers as "failed" and it doesn't give more details. It will be interesting to have the failure reason in couch to avoid the need to open log files by the operator at each time where there is a transfer failure. Information in log files will be needed only for a deeper debug. To address this, it is needed to add in files_database documents a new attribute, like FailureReason or something similar, which will take as value the reason of the transfer failure.

is there any comments?

drsm79 commented 13 years ago

metson: I don't think this should be added in the short-medium term, for the following reasons:

Longer term I think we should look into:

I'm not going to close this ticket, because I think there's work to be done here, but I'm going to mark it very low priority - there are other more important things to do (in CRAB and AsyncStageout) first.

spigad commented 12 years ago

spiga: I'd say that Simon's comment is still valid.

DMWMBot commented 12 years ago

riahi: The ftslog file is parsed using parse_ftscp_results method in ASO to extract if a transfer was failed or succeeded. This method can be extended to add log documents in a new ASO database (e.g asynctransfer_logs).

1 document per output transfer (not per copyjobfile) will be included in this database. Here is the format (*) of the output of ftscp if a transfer request is submitted correctly to FTS. So, the basic fields of a document in asynctransfer_logs can be: Source, Destination, State, Retries, Reason, Duration. The ftslog files (logs of copyjobfile transfers) will be also uploaded in this database as attachments.

The basic index views using couchDB-Lucene can be by_Reason and by_attachment.

How sounds?

(*) Source: srm://maite.iihe.ac.be:8443/srm/managerv2?SFN=/pnfs/iihe/cms/ph/sc4/store/temp/user/riahi/RelValProdTTbar/1328885421/v1/0000/C897A1B9-E959-E111-99B8-68B59972C1DC.root Destination: srm://t2-srm-02.lnl.infn.it:8443/srm/managerv2?SFN=/pnfs/lnl.infn.it/data/cms/store/user/riahi/RelValProdTTbar/1328885421/v1/0000/C897A1B9-E959-E111-99B8-68B59972C1DC.root State: Failed Retries: 1 Reason: SOURCE error during TRANSFER_PREPARATION phase: [USER_ERROR] source file doesn't exist Duration: 0

cinquo commented 12 years ago

mcinquil: I would suggest to add the timestamp/gmtime of when it happened.

HassenRiahi commented 11 years ago

The pull request https://github.com/dmwm/AsyncStageout/pull/3938 is created to address this ticket, namely report the transfers errors into files_database. I will open 2 new tickets from this one to:

1- address the propagation of transfers errors to end_users 2- make the transfers logs easily available to DataOps in couch to follow up on specific issues (maybe by using CouchDB-Lucene)