Closed pfjacques closed 6 years ago
Hi Pieter, the Abandoning alert happens when PhEDEx fails to check the status of the submitted FTS job after a certain timeout (default 1 hour) . You can adjust the timeout with -job-awol option, see https://github.com/dmwm/PHEDEX/blob/master/Toolkit/Transfer/FileDownload#L113
But it in your case it looks like all status checks are failing:
I see a lot of "Error reading the server's reply with the status of the job's files: expected value " in [*] . Does your proxy look OK? Can you check status manually from the server node?
Natalia.
[]https://cmsweb.cern.ch/phedex/prod/Activity::ErrorInfo?tofilter=T3_US_Rutgers&fromfilter=.&report_code=.&xfer_code=.&to_pfn=.&from_pfn=.&log_detail=.&log_validate=.&.submit=Update#
Looking at FTS log for eb3ec312-0246-11e8-b31a-a0369f23cf8e , the transfer was cancelled by timeout. Since this is operational and not development issue, please open a GGUS ticket. I am closing this.
Natalia,
I understand, that the abandoing alert happens after a 1 hour timeout. Since all the status checks are failing, changing the timeout itself won't help.
I generate the proxy with voms-proxy-init -voms cms -rfc -out new777.tmp -hours 192 -valid 192:00
I generate a new proxy once a day via a cron job. The underlying certificate is my personal grid certificate, which is valid until May 8th, so I think the proxy should be OK. In any case, this is the procedure I've been using for several years now, and it has always worked OK previously.
I'm not sure I know how to check the status manually. Can you provide details on that?
Thanks for your assistance with this,
+-------------------------------------------------+ | Pieter F. Jacques (jacques@physics.rutgers.edu) | | Serin Physics Laboratory, Rutgers University | | 136 Frelinghuysen Road | | Piscataway, NJ 08854-8019 USA | | Telephone: 848-445-8977 | +-------------------------------------------------+
On Fri, 26 Jan 2018, nataliaratnikova wrote:
Hi Pieter,
the Abandoning alert happens when PhEDEx fails to check the status of the submitted FTS job after a certain timeout (default 1 hour) . You can adjust the timeout with -job-awol option, see
https://github.com/dmwm/PHEDEX/blob/master/Toolkit/Transfer/FileDownload#L113
But it in your case it looks like all status checks are failing:
I see a lot of "Error reading the server's reply with the status of the job's files: expected value " in [*] . Does your proxy look OK? Can you check status manually from the server node?
Natalia.
[]https://cmsweb.cern.ch/phedex/prod/Activity::ErrorInfo?tofilter=T3_US_Rutgers&fromfilter=.&report_code=.&xfer_code=.&to_pfn=.&from_pfn=.&log_detail=.&log_validate=.&.submit=Update#
-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/dmwm/PHEDEX/issues/1112#issuecomment-360832623
Where do I go to open a GGUX ticket?
On Fri, 26 Jan 2018, nataliaratnikova wrote:
Looking at FTS log for eb3ec312-0246-11e8-b31a-a0369f23cf8e , the transfer was cancelled by timeout. Since this is operational and not development issue, please open a GGUS ticket. I am closing this.
-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/dmwm/PHEDEX/issues/1112#issuecomment-360837843
I just sent email to your professional address, asking site support team to help you opening a GGUS ticket.
Hallo Pieter, please go to https://ggus.eu/?mode=ticket_cms to open a GGUS ticket. For "CMS Support Unit" please select "CMS Datatransfers" to route the ticket to the team that handles/assists with transfer issues. In case you don't have a GGUS account yet, please go to the "Registration" on the left menu and register. Please select "support access" so you can update tickets. Thanks,
On Jan. 24 I approved PhEDEx request 1200285 for transfer of 4.5TB (2457 files) to T3_US_Rutgers. The transfers started a little later that day and were completing successfully, except that the download-srm file has many entries like this:
2018-01-24 19:49:58: FileDownload[4796]: FTS job JOBID=c17a8726-013f-11e8-98b8-a0369f23cf8e submitted 2018-01-24 19:49:58: QMon[4796]: Queueing JOBID=c17a8726-013f-11e8-98b8-a0369f23cf8e at priority 1 2018-01-24 19:49:58: FileDownload[4796]: FTS job JOBID=c18837f4-013f-11e8-b4db-a0369f23cf8e submitted 2018-01-24 19:49:58: QMon[4796]: Queueing JOBID=c18837f4-013f-11e8-b4db-a0369f23cf8e at priority 1 2018-01-24 19:50:01: QMon[4796]: alert: ListJob for c17a8726-013f-11e8-98b8-a0369f23cf8e returned error: ended with status 1 2018-01-24 19:50:08: QMon[4796]: alert: ListJob for c18837f4-013f-11e8-b4db-a0369f23cf8e returned error: ended with status 1
The "ended with status 1" message recurs for about one hour, after which:
2018-01-24 20:50:35: QMon[4796]: alert: Abandoning JOBID=c17a8726-013f-11e8-98b8-a0369f23cf8e after timeout (3600 seconds)
At that point another file transfer process runs, downloading another of the 2457 requested files.
This is a problem because the one hour delay means that files are being transferred at a very slow rate.
I thought this might be because I was running PhEDEx version 4.2.1, so I tried upgrading to 4.2.2, but that did not help, and in fact made the situation worse, as under 4.2.2 all entries in download-srm show failures, with an exit status code of 1. They also contain ... detail=(Could not submit to FTS ~~ ) ...
Please advise regarding steps I should take to debug this and get PhEDEx working correctly again.
+-------------------------------------------------+ | Pieter F. Jacques (jacques@physics.rutgers.edu) | | Serin Physics Laboratory, Rutgers University | | 136 Frelinghuysen Road | | Piscataway, NJ 08854-8019 USA | | Telephone: 848-445-8977 | +-------------------------------------------------+