DIRACGrid / DIRAC

DIRAC Grid
http://diracgrid.org
GNU General Public License v3.0
112 stars 174 forks source link

Problems with asynchronous FTS transfers (dirac-dms-replicate-and-register-request) v6r11p12 #2095

Closed martynia closed 8 years ago

martynia commented 9 years ago

Hello, We still have problems with FTS transfers. Although many issues have been fixed thanks to help of Christophe and Andrei, the current situation is that a request stays in a 'Scheduled' state forever.

A job appears in the FTSFile DB table but is not transferred to the FTSJob table. The FTSAgent throws an exception when trying to contact the FileCatalog. The exception message is of type <class 'DIRAC.RequestManagementSystem.Client.File.File'> caught here: FTSAgent.py, line 625:

except Exception, exceptMessage: log.exception( "Exception in processRequest", exceptMessage ) finally:

The FileCatalog fails with messages like this:

9-26 12:18:09 UTC DataManagement/FileCatalog INFO: ResolvePFN : True
2014-09-26 12:18:09 UTC DataManagement/FileCatalog INFO: UniqueGUID : False
2014-09-26 12:18:09 UTC DataManagement/FileCatalog INFO: ValidFileStatus : ['AprioriGood', 'Trash', 'Removing', 'Probing'] 2014-09-26 12:18:09 UTC DataManagement/FileCatalog INFO: ValidReplicaStatus : ['AprioriGood', 'Trash', 'Removing', 'Probing'] 2014-09-26 12:18:09 UTC DataManagement/FileCatalog INFO: VisibleFileStatus : ['AprioriGood']
2014-09-26 12:18:09 UTC DataManagement/FileCatalog INFO: VisibleReplicaStatus : ['AprioriGood']
2014-09-26 12:18:09 UTC DataManagement/FileCatalog ALWAYS: Listening at dips://dirac.grid.hep.ph.ic.ac.uk:9197/DataManagement/FileCatalog 2014-09-26 12:19:54 UTC DataManagement/FileCatalog ERROR: ConnectionError ConnectionError: Error while receiving arguments (146.179.247.169:48087)[hosts:dirac.grid.hep.ph.ic.ac.uk] Exception while reading from peer: (-1, 'Unexpected EOF') 2014-09-26 12:19:54 UTC DataManagement/FileCatalog ERROR: Error processing proposal ConnectionError: Error while receiving arguments (146.179.247.169:48087)[hosts:dirac.grid.hep.ph.ic.ac.uk] Exception while reading from peer: (-1, 'Unexpected EOF')

We know that the Catalog works fine with user requests made from a UI, and also works when the dms is used to create file replicas (dirac-dms-repliacate-and-register-request).

Any help will be greatly appreciated. thanks, JM

atsareg commented 9 years ago

Is it still persisting ? Looks like a communication error in the network.

martynia commented 9 years ago

Yes, it is still persiting. And only between the fts agent and the catalog. Otherwise the catalog works ok. Cheers, Janusz Martyniak

On 6 Oct 2014, at 23:08, Andrei Tsaregorodtsev notifications@github.com wrote:

Is it still persisting ? Looks like a communication error in the network.

— Reply to this email directly or view it on GitHub.

acasajus commented 9 years ago

What does the traceback of the FTSAgent show?

acasajus commented 9 years ago

BTW Anyone having a method longer than 40 lines should be publicly scourged and posted to the wall of shame for all eternity.

martynia commented 9 years ago

Hi Adria,

Thanks for looking into this. We were debugging this with Christophe up to the point where we reached a place where an exception is thrown (see log marked CHRIS). This led us to a conclusion that the FTSAgent failed to talk to the Catalog: I can talk to the Catalog from a command line:

[martynia@heppc184 dirac]$ dirac-dms-lfn-replicas /na62.vo.gridpp.ac.uk/martynia/test_file.bin {'Failed': {}, 'Successful': {'/na62.vo.gridpp.ac.uk/martynia/test_file.bin': {'LIVERPOOL-disk': 'srm://hepgrid11.ph.liv.ac.uk:8446/srm/managerv2?SFN=/dpm/ph.liv.ac.uk/home/na62.vo.gridpp.ac.uk/martynia/test_file.bin'}}}

Now the logs (mind 2 sets of logs, the second one shows the LFC logs). And yes, I agree with a comment that a method in question is far too long and it is hard to follow it and see where it crashes):

2014-09-26 11:55:54 UTC DataManagement/FTSAgent NOTICE: ---------------------------------------- 2014-09-26 11:55:54 UTC DataManagement/FTSAgent NOTICE: Starting cycle for module DataManagement/FTSAgent 2014-09-26 11:55:54 UTC DataManagement/FTSAgent NOTICE: Remaining 498 of 500 cycles 2014-09-26 11:55:54 UTC DataManagement/FTSAgent NOTICE: ---------------------------------------- 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/Monitoring DEBUG: Adding mark to MEM 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Discovering URL for service RequestManagement/ReqManager -> dips://dirac.grid.hep.ph.ic.ac.uk:9140/RequestManagement/ReqManager 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Connecting to: dips://dirac.grid.hep.ph.ic.ac.uk:9140/RequestManagement/ReqManager 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Using certificate /opt/dirac/etc/grid-security/hostcert.pem 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Using key /opt/dirac/etc/grid-security/hostkey.pem 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Authenticated peer (/C=UK/O=eScience/OU=Imperial/L=Physics/CN=dirac.grid.hep.ph.ic.ac.uk) 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: New session connecting to server at ('dirac.grid.hep.ph.ic.ac.uk', 9140) 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: New connection -> 146.179.247.169:9140 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Closing socket 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/execute INFO: CHRIS (execute): requestNames {'OK': True, 'rpcStub': (('dips://dirac.grid.hep.ph.ic.ac.uk:9140/RequestManagement/ReqManager', {'skipCACheck': False, 'keepAliveLapse': 150, 'delegatedGroup': 'hosts', 'delegatedDN': '/C=UK/O=eScience/OU=Imperial/L=Physics/CN=dirac.grid.hep.ph.ic.ac.uk', 'timeout': 120}), 'getRequestNamesList', (['Scheduled'], 100)), 'Value': [('TEST_12', 'Scheduled', datetime.datetime(2014, 9, 26, 11, 53, 58))]} 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/execute INFO: found 1 requests to process: 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/execute INFO: => from internal cache: 0 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/execute INFO: => new read from RMS: 1 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Discovering URL for service RequestManagement/ReqManager -> dips://dirac.grid.hep.ph.ic.ac.uk:9140/RequestManagement/ReqManager 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/RequestManagement/ReqClient DEBUG: getRequest: attempting to get request. 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Connecting to: dips://dirac.grid.hep.ph.ic.ac.uk:9140/RequestManagement/ReqManager 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Using certificate /opt/dirac/etc/grid-security/hostcert.pem 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Using key /opt/dirac/etc/grid-security/hostkey.pem 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Authenticated peer (/C=UK/O=eScience/OU=Imperial/L=Physics/CN=dirac.grid.hep.ph.ic.ac.uk) 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: New session connecting to server at ('dirac.grid.hep.ph.ic.ac.uk', 9140) 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: New connection -> 146.179.247.169:9140 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Closing socket 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/Subprocess VERB: shellCall: voms-proxy-info -dont-verify-ac -file /tmp/tmp4j2UAU -all 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/execute INFO: CHRIS(execute): request {'OK': True, 'Value': <DIRAC.RequestManagementSystem.Client.Request.Request object at 0x238e290>} 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/execute INFO: request 'TEST_12' enqueued for execution 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/TEST_12 INFO: CHRIS (processRequest): requestName TEST_12

2014-09-26 11:55:54 UTC DataManagement/FTSAgent/TEST_12 INFO: CHRIS (processRequest): waiting operation {'OK': True, 'Value': <DIRAC.RequestManagementSystem.Client.Operation.Operation object at 0x238e650>} 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/Monitoring DEBUG: Adding mark to RequestsAtt

2014-09-26 11:55:54 UTC DataManagement/FTSAgent/execute INFO: CHRIS (execute): about to processAllResults 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/TEST_12 INFO: start processRequest 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Discovering URL for service DataManagement/FTSManager -> dips://dirac.grid.hep.ph.ic.ac.uk:9191/DataManagement/FTSManager 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Connecting to: dips://dirac.grid.hep.ph.ic.ac.uk:9191/DataManagement/FTSManager 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Using certificate /opt/dirac/etc/grid-security/hostcert.pem 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Using key /opt/dirac/etc/grid-security/hostkey.pem 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Authenticated peer (/C=UK/O=eScience/OU=Imperial/L=Physics/CN=dirac.grid.hep.ph.ic.ac.uk) 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: New session connecting to server at ('dirac.grid.hep.ph.ic.ac.uk', 9191) 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: New connection -> 146.179.247.169:9191 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Closing socket 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/TEST_12 INFO: CHRIS (processRequest) ftsJobs {'OK': True, 'Value': []} 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/TEST_12 INFO: CHRIS (processRequest) ftsJobs after filtre [] 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/TEST_12/checkReadyReplicas INFO: CHRIS (checkReadyReplicas) : before scheduledFiles
2014-09-26 11:55:54 UTC DataManagement/FTSAgent/TEST_12/checkReadyReplicas INFO: CHRIS (checkReadyReplicas) : request {'OK': True, 'Value': {'Status': 'Scheduled', 'LastUpdate': '2014-09-26 11:53:58', '__dirty': [], 'DIRACSetup': '', 'RequestID': 23L, 'OwnerDN': '/C=UK/O=eScience/OU=Imperial/L=Physics/CN=janusz martyniak', 'CreationTime': '2014-09-26 11:53:32', 'JobID': '', 'OwnerGroup': 'na62_user', 'SubmitTime': '2014-09-26 11:53:32', 'RequestName': 'TEST_12', 'Error': '', 'Operations': [{'Status': 'Scheduled', 'LastUpdate': '2014-09-26 11:53:58', 'TargetSE': 'IMPERIAL-disk', 'RequestID': '23', 'Files': [{'Status': 'Scheduled', 'Attempt': '1', 'ChecksumType': 'ADLER32', 'Checksum': 'd9b126db', 'FileID': '23', 'LFN': '/na62.vo.gridpp.ac.uk/martynia/test_file.bin', 'PFN': '', 'Error': '', 'GUID': '19BA23BF-5F9A-8657-A1A0-9A0C3DB06C43', 'OperationID': '0', 'Size': '5091161'}], 'CreationTime': '2014-09-26 11:53:32', 'SourceSE': '', 'SubmitTime': '2014-09-26 11:53:32', 'Catalog': '', 'Arguments': '', 'Error': '', 'Type': 'ReplicateAndRegister', 'Order': '0', 'OperationID': '23'}], 'SourceComponent': ''}} 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/TEST_12/checkReadyReplicas INFO: CHRIS (checkReadyReplicas) : after scheduledFiles {'/na62.vo.gridpp.ac.uk/martynia/test_file.bin': <DIRAC.RequestManagementSystem.Client.File.File object at 0x238e1d0>} 2014-09-26 11:55:54 UTC DataManagement/FTSAgent/FileCatalogFactory DEBUG: Creating FileCatalog client 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Trying to load Resources.Catalog.FileCatalogClient 2014-09-26 11:55:54 UTC DataManagement/FTSAgent DEBUG: Trying to load DIRAC.Resources.Catalog.FileCatalogClient 2014-09-26 11:55:55 UTC DataManagement/FTSAgent/FileCatalogFactory DEBUG: Loaded module FileCatalogClient 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: Discovering URL for service DataManagement/FileCatalog -> dips://dirac.grid.hep.ph.ic.ac.uk:9197/DataManagement/FileCatalog 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: Connecting to: dips://dirac.grid.hep.ph.ic.ac.uk:9197/DataManagement/FileCatalog 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: Using certificate /opt/dirac/etc/grid-security/hostcert.pem 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: Using key /opt/dirac/etc/grid-security/hostkey.pem 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: Authenticated peer (/C=UK/O=eScience/OU=Imperial/L=Physics/CN=dirac.grid.hep.ph.ic.ac.uk) 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: New session connecting to server at ('dirac.grid.hep.ph.ic.ac.uk', 9197) 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: New connection -> 146.179.247.169:9197 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: Closing socket 2014-09-26 11:55:55 UTC DataManagement/FTSAgent/TEST_12 INFO: CHRIS (processRequest): exception <class 'DIRAC.RequestManagementSystem.Client.File.File'> 2014-09-26 11:55:55 UTC DataManagement/FTSAgent/TEST_12 INFO: CHRIS (processRequest) finally 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: Discovering URL for service RequestManagement/ReqManager -> dips://dirac.grid.hep.ph.ic.ac.uk:9140/RequestManagement/ReqManager 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: Connecting to: dips://dirac.grid.hep.ph.ic.ac.uk:9140/RequestManagement/ReqManager 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: Using certificate /opt/dirac/etc/grid-security/hostcert.pem 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: Using key /opt/dirac/etc/grid-security/hostkey.pem 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: Authenticated peer (/C=UK/O=eScience/OU=Imperial/L=Physics/CN=dirac.grid.hep.ph.ic.ac.uk) 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: New session connecting to server at ('dirac.grid.hep.ph.ic.ac.uk', 9140) 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: New connection -> 146.179.247.169:9140 2014-09-26 11:55:55 UTC DataManagement/FTSAgent DEBUG: Closing socket 2014-09-26 11:55:55 UTC DataManagement/FTSAgent/TEST_12 INFO: CHRIS (processRequest) put {'OK': True, 'Value': None} 2014-09-26 11:55:55 UTC DataManagement/FTSAgent/TEST_12 INFO: CHRIS (processRequest) is there ftsJobs? [] 2014-09-26 11:55:55 UTC DataManagement/FTSAgent/execute INFO: CHRIS (execute): done processAllResults 2014-09-26 11:55:55 UTC DataManagement/FTSAgent/Monitoring DEBUG: Adding mark to CPU 2014-09-26 11:55:55 UTC DataManagement/FTSAgent NOTICE: ---------------------------------------- 2014-09-26 11:55:55 UTC DataManagement/FTSAgent NOTICE: Agent module DataManagement/FTSAgent run summary 2014-09-26 11:55:55 UTC DataManagement/FTSAgent NOTICE: Executed 3 times previously 2014-09-26 11:55:55 UTC DataManagement/FTSAgent NOTICE: Cycle took 0.92 seconds 2014-09-26 11:55:55 UTC DataManagement/FTSAgent NOTICE: Average execution time: 0.39 seconds 2014-09-26 11:55:55 UTC DataManagement/FTSAgent NOTICE: Polling time: 120 seconds 2014-09-26 11:55:55 UTC DataManagement/FTSAgent NOTICE: Average execution/polling time: 0.32% 2014-09-26 11:55:55 UTC DataManagement/FTSAgent NOTICE: Cycle was successful 2014-09-26 11:55:55 UTC DataManagement/FTSAgent NOTICE: ----------------------------------------

And now the Catalog:

( I restarted it but it kepis throwing those errors agai and again):

09-26 12:18:09 UTC DataManagement/FileCatalog INFO: ResolvePFN : True
2014-09-26 12:18:09 UTC DataManagement/FileCatalog INFO: UniqueGUID : False
2014-09-26 12:18:09 UTC DataManagement/FileCatalog INFO: ValidFileStatus : ['AprioriGood', 'Trash', 'Removing', 'Probing'] 2014-09-26 12:18:09 UTC DataManagement/FileCatalog INFO: ValidReplicaStatus : ['AprioriGood', 'Trash', 'Removing', 'Probing'] 2014-09-26 12:18:09 UTC DataManagement/FileCatalog INFO: VisibleFileStatus : ['AprioriGood']
2014-09-26 12:18:09 UTC DataManagement/FileCatalog INFO: VisibleReplicaStatus : ['AprioriGood']
2014-09-26 12:18:09 UTC DataManagement/FileCatalog ALWAYS: Listening at dips://dirac.grid.hep.ph.ic.ac.uk:9197/DataManagement/FileCatalog 2014-09-26 12:19:54 UTC DataManagement/FileCatalog ERROR: ConnectionError ConnectionError: Error while receiving arguments (146.179.247.169:48087)[hosts:dirac.grid.hep.ph.ic.ac.uk] Exception while reading from peer: (-1, 'Unexpected EOF') 2014-09-26 12:19:54 UTC DataManagement/FileCatalog ERROR: Error processing proposal ConnectionError: Error while receiving arguments (146.179.247.169:48087)[hosts:dirac.grid.hep.ph.ic.ac.uk] Exception while reading from peer: (-1, 'Unexpected EOF')

On 7 Oct 2014, at 14:16, Adrià Casajús notifications@github.com wrote:

What does the traceback of the FTSAgent show?

— Reply to this email directly or view it on GitHub.

martynia commented 9 years ago

Hello, Any news on the subject ? cheers JM

martynia commented 9 years ago

Hello Andrei,

I think the problem is more complicate in fact. I tried to add a new vo (comet) and upload a file to Imperial storage element:

dirac-dms-add-file LFN:/comet.j-parc.jp/martynia/200mb-test-file.bin 200mb-test-file.bin IMPERIAL-comet-disk

It looks like it is successful:

Uploading /comet.j-parc.jp/martynia/200mb-test-file.bin Successfully uploaded file to IMPERIAL-comet-disk

but if I try to list it, the command hangs hangs. I restart the file catalog and then:

dirac-dms-lfn-replicas LFN:/comet.j-parc.jp/martynia/200mb-test-file.bin

{'Failed': {}, 'Successful': {'/comet.j-parc.jp/martynia/200mb-test-file.bin': {}}}

As you see the dictionary is not complete (the file is on the disk, though. I checked that).

Maybe before I try to debug it in more detail, what is the recommended production version of the server now ?

I’m on v6r11p12 on the server.

It might be that my FTS transfer problems this thread relates to are actually caused by catalog itself.

    thanks in advance, JM 

On 7 Oct 2014, at 08:47, JANUSZ MARTYNIAK janusz.martyniak@googlemail.com wrote:

Yes, it is still persiting. And only between the fts agent and the catalog. Otherwise the catalog works ok. Cheers, Janusz Martyniak

On 6 Oct 2014, at 23:08, Andrei Tsaregorodtsev notifications@github.com wrote:

Is it still persisting ? Looks like a communication error in the network.

— Reply to this email directly or view it on GitHub.

fstagni commented 9 years ago

Was this fixed? (looking for possibilities to close tasks...)

chaen commented 8 years ago

No news since ages, and very old version, so please close