dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

Transfers failing because the FTS backend cannot parse submit command output #915

Open ericvaandering opened 10 years ago

ericvaandering commented 10 years ago

Original Savannah ticket 98594 reported by None on Thu Nov 1 12:43:26 2012.

Hello,

a rare transfer failure mode seen with the FileDownload agent with the FTS backend:

the FTS backend expects an FTS job ID as output of the glite-transfer-submit command, but sometimes the output is not what is expected, so the transfer fails with "Could not submit to FTS" and an undefined job ID.

An example Transfer Log is reported below:

+verbatim+ ---------- JOB-LOG ---------- 1351642390 created... backend: PHEDEX::Transfer::FTS glite-transfer-submit -s https://fts22-t1-import.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer -f /data/ProdNodes/Prod_T2_CH_CERN/state/download-t1/work/job.1347966267.13180/copyjob JOBID=undefined, cannot monitor this job: [...] ---------- RAWOUTPUT ---------- Source: srm://srm-cms.gridpp.rl.ac.uk:8443/srmc2ba17d2-22ef-11e2-bf5c-9605fd599dc3 -verbatim-

Looking at the RAWOUTPUT, it seems that the output of glite-transfer-submit (c2ba17d2-22ef-11e2-bf5c-9605fd599dc3) is mixed with part of the output of a glite-transfer-status command executed for a different transfer! (Source: srm://srm-cms.gridpp.rl.ac.uk:8443/srm)

Switched on POCO_DEBUG on CERN agents to try to figure out what's going on.

Cheers Nicolo'

ericvaandering commented 10 years ago

Comment by wildish on Fri Nov 2 03:37:35 2012

Hi,

I can think of two possible sources of this:

1) the agent fails to parse the jobID correctly, but doesn't then abandon the task correctly, thereby mixing things up somehow. That part of the code is perhaps less well tested than the rest.

2) I recall there was a bug in PoCo::Child that Ricky and I took ages to track down about 3-4 years ago. IIRC the wheelID was an unsigned int that was not protected against wrap-around, so when it did it managed to mix things up. I thought we had swatted that one, but maybe it has come back somehow. I'll see if I can find more details about that bug.

ericvaandering commented 10 years ago

Comment by magini on Fri Nov 2 09:59:45 2012

Hi Tony,

maybe it could be related to this ancient bug report?

https://savannah.cern.ch/bugs/?func=detailitem&item_id=45798

The symptoms today are similar: a bunch of tasks failing during submission with "Could not submit to FTS" because the agent cannot parse the glite-transfer-submit output, and at almost the same time we see alerts about jobs with "No "DESTINATION" key!", which means that the agent cannot parse the output of glite-transfer-status.

This strengthens the hypothesis that PoCo::Child is mixing the output of two different child commands.

Three years ago you put a protection in the agent to stop it from crashing, but it seems that the underlying bug that was causing the mixup was never fixed.

Let's wait until we can reproduce this issue at CERN.

In this case the issue is not critical, because the transfers will simply be resubmitted later, but I suppose that in other cases it could cause more serious issues (for example, if the agent mixes up the output of two different FileDownloadVerify commands, it could mark as 'good' the result of a 'bad' transfer).

N.

ericvaandering commented 10 years ago

Comment by magini on Mon Nov 5 12:55:06 2012

Hi,

another transfer failure like this:

+verbatim+ ---------- JOB-LOG ---------- 1352054495 created... backend: PHEDEX::Transfer::FTS glite-transfer-submit -s https://fts22-t1-import.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer -f /data/ProdNodes/Prod_T2_CH_CERN/state/download-t1/work/job.1351790993.1594/copyjob JOBID=undefined, cannot monitor this job: job = [...] ---------- RAWOUTPUT ---------- Source: 435e45f9-26af-11e2-b29d-ffcd381ede4c -verbatim-

POE debugging information for this command, and for a previous command with the same PID:

+verbatim+ 2012-11-04 17:24:15: QMon[11366]: debug: dequeue JOBID=7338cb18-26a4-11e2-bf5c-9605fd599dc3

> POE::Component::Child - run(): "glite-transfer-status -l --verbose -s https://fts22-t1-import.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer 7338cb18-26a4-11e2-bf5c-9605fd599dc3", wheel=97282, pid=4494

2012-11-04 18:41:35: FileDownload[11366]: debug: start copy job job.1351790993.1594 > POE::Component::Child - run(): "glite-transfer-submit -s https://fts22-t1-import.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer -f /data/ProdNodes/Prod_T2_CH_CERN/state/download-t1/work/job.1351790993.1594/copyjob", wheel=99314, pid=4494 -verbatim-

N.

ericvaandering commented 10 years ago

Comment by magini on Wed Nov 7 09:45:19 2012

Hi,

and another one. This time I was able to identify exactly which two processes got mixed up:

glite-transfer-submit process:

+verbatim+ glite-transfer-submit -s https://fts-t2-service.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer -t CMS_DEFAULT -f /data/DebugNodes/Debug_T1_CH_CERN_Buffer/state/download-t2/work/job.1351790899.4341/copyjob JOBID=undefined, cannot monitor this job: job = [...] ---------- RAWOUTPUT ---------- Destination: srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/PhEDEx_LoadTest07/LoadTest07_Debug_Estonia/CERN/411/LoadTest07_Estdab2abb5-28da-11e2-a03a-96f506f1f878 -verbatim-

+verbatim+ 2012-11-07 12:58:40: FileDownload[22322]: debug: copy job job.1351790899.4341 status: pend=0 ready=1 done=0 lost=0 2012-11-07 12:58:40: FileDownload[22322]: debug: start copy job job.1351790899.4341 [...] > POE::Component::Child - run(): "glite-transfer-submit -s https://fts-t2-service.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer -t CMS_DEFAULT -f /data/DebugNodes/Debug_T1_CH_CERN_Buffer/state/download-t2/work/job.1351790899.4341/copyjob", wheel=103163, pid=24705 -verbatim-

glite-transfer-status process:

+verbatim+ 2012-11-07 12:58:40: QMon[22322]: debug: dequeue JOBID=d133a3e8-28da-11e2-a03a-96f506f1f878 > POE::Component::Child - run(): "glite-transfer-status -l --verbose -s https://fts-t2-service.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer d133a3e8-28da-11e2-a03a-96f506f1f878", wheel=103165, pid=24709 [...] 2012-11-07 12:58:41: QMon[22322]: alert: ListJob for d133a3e8-28da-11e2-a03a-96f506f1f878 returned error: No "DESTINATION" key! : DURATION=0 , REASON=(null) , RETRIES=0 , SOURCE=srm://ganymede.hep.kbfi.ee:8888/srm/v2/server?SFN=/hdfs/cms/LoadTest07/LoadTest07_Estonia_3F onia_3F_lDaaTKuxVLWPJ2qo_411 , STATE=Active 2012-11-07 12:58:41: QMon[22322]: debug: WorkStats: class=JOBS key=d133a3e8-28da-11e2-a03a-96f506f1f878 value=undefined -verbatim-

The processes have different wheel IDs and PIDs, but they were running simultaneously.

ericvaandering commented 10 years ago

Comment by magini on Tue Dec 11 11:10:20 2012

Hi,

another case caught live, this time printing out +verbatim+ $args->{out} -verbatim- as it is in PHEDEX::Core:JobManager::_child_stdout

+verbatim+ > POE::Component::Child - run(): "glite-transfer-submit -s https://fts22-t1-import.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer -t CMS_DEFAULT -f /data/DebugNodes/Debug_T1_CH_CERN_Buffer/state/download-t1/workdir/job.1355159361.372/copyjob", wheel=14492, pid=21542 2012-12-11 07:59:00: QMon[9713]: debug: dequeue JOBID=96289d8e-4368-11e2-ae15-cf1c1ceec317 > POE::Component::Child - run(): "glite-transfer-status -l --verbose -s https://fts22-t1-import.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer 96289d8e-4368-11e2-ae15-cf1c1ceec3 17", wheel=14493, pid=21544 [...] Request ID: 96289d8e-4368-11e2-ae15-cf1c1ceec317 Status: Active Channel: PIC-CERN Client DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cassel/CN=696497/CN=Rapolas Kaselis Reason: <None> Submit time: 2012-12-11 07:58:45.492 Files: 5 Priority: 3 VOName: cms Done: 0 Active: 5 Pending: 0 Ready: 0 Canceled: 0 Failed: 0 Finishing: 0 Finished: 0 Submitted: 0 Hold: 0 Waiting: 0 Source: srm://srmcms.pic.es:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/store/PhEDEx_LoadTest07/LoadTest07_Prod_PIC/T2_Spain/LoadTest07_PIC_8D Destination: srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/PhEDEx_LoadTest07/LoadTest07_Debug_PIC/CERN/1719/LoadTest07_PIC_8D_XCx2xKk0SMU3cXKO_1719 State: Active Retries: 0 Reason: (null) Duration: 0 Source: srm://srmcms.pic.es:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/store/PhEDEx_LoadTest07/LoadTest07_Prod_PIC/T2_Spain/LoadTest07_PIC_09 Destination: srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/PhEDEx_LoadTest07/LoadTest07_Debug_PIC/CERN/1719/LoadTest07_PIC_09_znpj99N5wMoO4AE2_1719 State: Active Retries: 0 Reason: (null) Duration: 0 Source: srm://srmcms.pic.es:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/store/PhEDEx_LoadTest07/LoadTest07_Prod_PIC/T2_Spain/LoadTest07_PIC_0D Destination: srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/PhEDEx_LoadTest07/LoadTest07_Debug_PIC/CERN/1719/LoadTest07_PIC_0D_FFnKKpGtUppaDBbc_1719 State: Active Retries: 0 Reason: (null) Duration: 0 Source: srm://srmcms.pic.es:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/store/PhEDEx_LoadTest07/LoadTest07_Prod_PIC/T2_Spain/LoadTest07_PIC_8D Destination: srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/PhEDEx_LoadTest07/LoadTest07_Debug_PIC/CERN/1719/LoadTest07_PIC_8D_4Lx3YgczCBewyaSu_1719 State: Active Retries: 0 Reason: (null) Duration: 0 Source: srm://srmcms.pic.e9f2b7071-4368-11e2-ae15-cf1c1ceec317 s:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/store/PhEDEx_LoadTest07/LoadTest07_Prod_PIC/T2_Spain/LoadTest07_PIC_85 Destination: srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/PhEDEx_LoadTest07/LoadTest07_Debug_PIC/CERN/1719/LoadTest07_PIC_85_1pOJamjZ6tYFrcWd_1719 State: Active Retries: 0 Reason: (null) Duration: 0 > POE::Component::Child - close() > POE::Component::Child - close() -verbatim-

The interesting line is this one:

+verbatim+ Source: srm://srmcms.pic.e9f2b7071-4368-11e2-ae15-cf1c1ceec317 s:8443/srm/managerv2?SFN=/pnfs/pic.es/data/cms/store/PhEDEx_LoadTest07/LoadTest07_Prod_PIC/T2_Spain/LoadTest07_PIC_85 -verbatim-

From this, we see that the output of the two commands (glite-transfer-submit and glite-transfer-status) is already mixed up when POE::Component::Child passes it to PHEDEX::Core:JobManager::_child_stdout

N.