dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

PhEDEx transfer issues to T3_US_Rutgers #1112

Closed pfjacques closed 6 years ago

pfjacques commented 6 years ago

On Jan. 24 I approved PhEDEx request 1200285 for transfer of 4.5TB (2457 files) to T3_US_Rutgers. The transfers started a little later that day and were completing successfully, except that the download-srm file has many entries like this:

2018-01-24 19:49:58: FileDownload[4796]: FTS job JOBID=c17a8726-013f-11e8-98b8-a0369f23cf8e submitted 2018-01-24 19:49:58: QMon[4796]: Queueing JOBID=c17a8726-013f-11e8-98b8-a0369f23cf8e at priority 1 2018-01-24 19:49:58: FileDownload[4796]: FTS job JOBID=c18837f4-013f-11e8-b4db-a0369f23cf8e submitted 2018-01-24 19:49:58: QMon[4796]: Queueing JOBID=c18837f4-013f-11e8-b4db-a0369f23cf8e at priority 1 2018-01-24 19:50:01: QMon[4796]: alert: ListJob for c17a8726-013f-11e8-98b8-a0369f23cf8e returned error: ended with status 1 2018-01-24 19:50:08: QMon[4796]: alert: ListJob for c18837f4-013f-11e8-b4db-a0369f23cf8e returned error: ended with status 1

The "ended with status 1" message recurs for about one hour, after which:

2018-01-24 20:50:35: QMon[4796]: alert: Abandoning JOBID=c17a8726-013f-11e8-98b8-a0369f23cf8e after timeout (3600 seconds)

At that point another file transfer process runs, downloading another of the 2457 requested files.

This is a problem because the one hour delay means that files are being transferred at a very slow rate.

I thought this might be because I was running PhEDEx version 4.2.1, so I tried upgrading to 4.2.2, but that did not help, and in fact made the situation worse, as under 4.2.2 all entries in download-srm show failures, with an exit status code of 1. They also contain ... detail=(Could not submit to FTS ~~ ) ...

Please advise regarding steps I should take to debug this and get PhEDEx working correctly again.

+-------------------------------------------------+ | Pieter F. Jacques (jacques@physics.rutgers.edu) | | Serin Physics Laboratory, Rutgers University | | 136 Frelinghuysen Road | | Piscataway, NJ 08854-8019 USA | | Telephone: 848-445-8977 | +-------------------------------------------------+

nataliaratnikova commented 6 years ago

Hi Pieter, the Abandoning alert happens when PhEDEx fails to check the status of the submitted FTS job after a certain timeout (default 1 hour) . You can adjust the timeout with -job-awol option, see https://github.com/dmwm/PHEDEX/blob/master/Toolkit/Transfer/FileDownload#L113

But it in your case it looks like all status checks are failing:

I see a lot of "Error reading the server's reply with the status of the job's files: expected value " in [*] . Does your proxy look OK? Can you check status manually from the server node?

Natalia.

[]https://cmsweb.cern.ch/phedex/prod/Activity::ErrorInfo?tofilter=T3_US_Rutgers&fromfilter=.&report_code=.&xfer_code=.&to_pfn=.&from_pfn=.&log_detail=.&log_validate=.&.submit=Update#

nataliaratnikova commented 6 years ago

Looking at FTS log for eb3ec312-0246-11e8-b31a-a0369f23cf8e , the transfer was cancelled by timeout. Since this is operational and not development issue, please open a GGUS ticket. I am closing this.

pfjacques commented 6 years ago

Natalia,

I understand, that the abandoing alert happens after a 1 hour timeout. Since all the status checks are failing, changing the timeout itself won't help.

I generate the proxy with voms-proxy-init -voms cms -rfc -out new777.tmp -hours 192 -valid 192:00

I generate a new proxy once a day via a cron job. The underlying certificate is my personal grid certificate, which is valid until May 8th, so I think the proxy should be OK. In any case, this is the procedure I've been using for several years now, and it has always worked OK previously.

I'm not sure I know how to check the status manually. Can you provide details on that?

Thanks for your assistance with this,

+-------------------------------------------------+ | Pieter F. Jacques (jacques@physics.rutgers.edu) | | Serin Physics Laboratory, Rutgers University | | 136 Frelinghuysen Road | | Piscataway, NJ 08854-8019 USA | | Telephone: 848-445-8977 | +-------------------------------------------------+

On Fri, 26 Jan 2018, nataliaratnikova wrote:

Hi Pieter,

the Abandoning alert happens when PhEDEx fails to check the status of the submitted FTS job after a certain timeout (default 1 hour) . You can adjust the timeout with -job-awol option, see

https://github.com/dmwm/PHEDEX/blob/master/Toolkit/Transfer/FileDownload#L113

But it in your case it looks like all status checks are failing:

I see a lot of "Error reading the server's reply with the status of the job's files: expected value " in [*] . Does your proxy look OK? Can you check status manually from the server node?

Natalia.

[]https://cmsweb.cern.ch/phedex/prod/Activity::ErrorInfo?tofilter=T3_US_Rutgers&fromfilter=.&report_code=.&xfer_code=.&to_pfn=.&from_pfn=.&log_detail=.&log_validate=.&.submit=Update#

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/dmwm/PHEDEX/issues/1112#issuecomment-360832623

pfjacques commented 6 years ago

Where do I go to open a GGUX ticket?

On Fri, 26 Jan 2018, nataliaratnikova wrote:

Looking at FTS log for eb3ec312-0246-11e8-b31a-a0369f23cf8e , the transfer was cancelled by timeout. Since this is operational and not development issue, please open a GGUS ticket. I am closing this.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/dmwm/PHEDEX/issues/1112#issuecomment-360837843

nataliaratnikova commented 6 years ago

I just sent email to your professional address, asking site support team to help you opening a GGUS ticket.

stlammel commented 6 years ago

Hallo Pieter, please go to https://ggus.eu/?mode=ticket_cms to open a GGUS ticket. For "CMS Support Unit" please select "CMS Datatransfers" to route the ticket to the team that handles/assists with transfer issues. In case you don't have a GGUS account yet, please go to the "Registration" on the left menu and register. Please select "support access" so you can update tickets. Thanks,