dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
277 stars 133 forks source link

GridFTP incompatibilities with Globus Online #3545

Open ahaupt opened 6 years ago

ahaupt commented 6 years ago

Hi,

dCache version: 2.16.47

We are suffering from a rather long-standing incompatibility with Globus Online's GridFTP implementation. In our case Icecube is suffers from this problem. GO transfers files in parallel but as soon as one file is transferred successfully, it cancels all other still ongoing transfers.

Here an example, finished transfer:

09.25 15:14:52 [door:GFTP-plum15-AAVaAuwI7mA@gridftp-plum15Domain:request] ["/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=jade/jade-lta.icecube.wisc.edu":16892:248:184.73.189.163] [00009B888CD56F6447778A58209BCDFF1237,176851709905] [/pnfs/ifh.de/acs/icecube/archive/data/exp/IceCube/2015/unbiased/PFDST/0318/9e20f17c-b429-44d3-bcba-1f8e8c0480dd.zip] icecube:pfdst@osm 1810144 0 {0:""}

And here one transfer that gets cancelled just in the same moment:

09.25 15:14:52 [door:GFTP-plum15-AAVaAuwI6ng@gridftp-plum15Domain:request] ["/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=jade/jade-lta.icecube.wisc.edu":16892:248:184.73.189.163] [000013047092D1644B3CB9C32F368DC4CCDD,0] [/pnfs/ifh.de/acs/icecube/archive/data/exp/IceCube/2015/unbiased/PFDST/1211/feaab171-9c23-46ee-8a6b-f1a53bf173f3.zip] icecube:pfdst@osm 1810481 0 {451:"Aborting transfer due to session termination"}

I guess, GO uses a GridFTP feature dCache doesn't support (pipelining?). Any idea how to smoothly interoperate with GO?

paulmillar commented 6 years ago

Could you find the FTP client (GO) operations for the successful transfer? The access log file should contain all operations and dCache's response. A simple grep using the session (door:GFTP-plum15-AAVaAuwI7mA@gridftp-plum15Domain) should provide the commands.

ahaupt commented 6 years ago

Hi Paul,

I'm not so familiar with GO ... I do not find the failed transfer I mentioned here, but this one looks identical and failed at the same time:

Error (transfer)

Endpoint: IceCube Gridftp Scratch (aec5c658-f77d-11e6-ba7f-22000b9a448b) Server: gridftp-scratch.icecube.wisc.edu:2811 File: /mnt/tank/jade/bundles1/data/exp/IceCube/2015/unbiased/PFDST/0404/29288523-aad1-40a6-9737-e22824041da5.zip Command: RETR /mnt/tank/jade/bundles1/data/exp/IceCube/2015/unbiased/PFDST/0404/29288523-aad1-40a6-9737-e22824041da5.zip Message: Fatal FTP response

Details: 500-Command failed. : callback failed.\r\n500-globus_xio: System error in writev: Broken pipe\r\n500-globus_xio: A system call failed: Broken pipe\r\n500 End.\r\n

The successful attempt only mentions this:

{ "files_succeeded": 1 }

Here the "grepped" session log from our gridftp door:

gridftp-plum15Domain.access.txt

paulmillar commented 6 years ago

Thanks for the information.

The message Broken pipe message is from the Globus FTP server, not from dCache. My guess is that the FTP client (GO) disconnected from dCache, which aborted the transfer. This resulted in dCache tearing down the data connections, triggering the error message you see.

Unfortunately, the access log you found is almost certainly not the connection that experience the problem. Certainly, there is no indication of a problem in that access log file.

Instead, it shows the FTP client (GO) disconnecting shortly after starting a new transfer, apparently unprovoked.

I have seen this behaviour before. It comes from the recovery produce GO uses, where it disconnects all FTP connections when there is a problem with any connection; therefore, it is quite likely that the problem was with some other FTP connection: either the same GridFTP door or another GridFTP door.

You could try restricting GO to making a single transfer at any time and try to recreate the problem there. This should make it easier to discover why GO is aborting.

paulmillar commented 6 years ago

Would it be possible to try the latest dCache version (3.2) -- well, it's not yet released, but we're just putting together the release notes ?

This has a couple of features that GO requires (dynamic checksum calculation; command pipelining).

Perhaps you could set up a small test system just to demonstrate whether GO works better with this version of dCache.

ahaupt commented 6 years ago

Hi Paul,

Upgrading our test system once version 3.2 is released should not be much of a problem. But getting a firewall exception for that system is one ... Any idea how to simulate the GO client with e.g. globus-url-copy? GO looks like a "black-box client" to me ...

Is prometheus.desy.de public so that we could use it for compatibility tests?

Thanks! Andreas

paulmillar commented 6 years ago

If you like, you can take one of the latest 3.2 pre-release builds and try that:

https://ci.dcache.org/view/dCache%203.2/job/dCache-v3.2/

Unfortunately, I'm not sure how to emulate GO with globus-url-copy. In my experiments, I created a virtual machine and ran the GO packaged server there. However, mostly it was a case of observing what GO does when interacting with dCache, instrumenting error cases, and the occasional inspired detective work to understand what was going wrong and get it to work with dCache.

Yes, you can certainly use prometheus for testing -- that's one of its major reasons for existing. Various VOs are already authorised, but I can also create an account specifically for you (tied to your DN). Just drop me an email if that would be useful.

gonzalomerino commented 6 years ago

Hello, this is Gonzalo from IceCube @ UW-Madison.

We have a data archive service here that issues Globus-online transfers to archive data from endpoint A to endpoint B. If you think it might be of useful for your testing, we could quite easily direct some arbitrary transfer load to a test endpoint that you would point us to.

Gonzalo

paulmillar commented 6 years ago

Hi Gonzalo,

Sorry for the delay in getting back in touch -- what I propose is giving you an account on our test system called 'prometheus'. This would allow a much faster turn-around for getting to the bottom of any problems with GO.

Could you send me the output of

htpasswd -n -m <username> | sed 's/\$/\\\$/g'

(replacing <username> with your preferred username on the system) along with the DN of your X.509 certificate -- preferably via email.

Cheers,

Paul.

gonzalomerino commented 6 years ago

Hello Paul,

Are there any updates on the debugging of this issue?

thanks! Gonzalo

paulmillar commented 6 years ago

My apologies for the delay in replying.

There was an unresolved issue with the update that prevented me from updating dCache so it supports the Globus transfer-service. It turns out the problem was not with the patch, but with the existing dCache code. That problem is now fixed, so I've deployed the patch.

Currently the patch is in our 'master' branch. This means it appears in prometheus test system right now, so you should be able to verify that it works there.

We will back-port the fix to our stable branches, going back to dCache v3.2. It's too late to do that for this release cycle (due out tomorrow), but it should be available as part of the next release cycle (due next Tuesday: 2018-03-06).

gonzalomerino commented 6 years ago

Nice. Thanks! On Mon, Feb 26, 2018 at 6:59 AM Paul Millar notifications@github.com wrote:

My apologies for the delay in replying.

There was an unresolved issue with the update that prevented me from updating dCache so it supports the Globus transfer-service. It turns out the problem was not with the patch, but with the existing dCache code. That problem is now fixed, so I've deployed the patch.

Currently the patch is in our 'master' branch. This means it appears in prometheus test system right now, so you should be able to verify that it works there.

We will back-port the fix to our stable branches, going back to dCache v3.2. It's too late to do that for this release cycle (due out tomorrow), but it should be available as part of the next release cycle (due next Tuesday: 2018-03-06).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/3545#issuecomment-368494653, or mute the thread https://github.com/notifications/unsubscribe-auth/AHjryV1KMDEnnFyNsX79nAZNd5OMg_6yks5tYqq0gaJpZM4PiwJ- .

gonzalomerino commented 6 years ago

Hello Paul,

No problem. Thanks for looking into this.

I tried to submit a sync transfer of few hundred files from UW-Madison to Prometheus yesterday.

The transfer eventually completed, but there were few errors appearing at around midnight, Central US Time.

I paste below some of these errors. I don't know if they are relevant, or if you could correlate with some errors looking at the logs on your side. Tell me if you see something, or if you would like to see some specific test.


2018-03-15 12:01 am unknown error

Error (transfer) Endpoint: Prometheus DESY test server (eca64ab0-b811-11e7-b125-22000a92523b) Server: prometheus.desy.de:2811 File: /Users/gmerino/0101/PFFilt_PhysicsFiltering_Run00129005_Subrun00000000_00000260.tar.bz2 Command: CKSM MD5 0 -1 /Users/gmerino/0101/PFFilt_PhysicsFiltering_Run00129005_Subrun00000000_00000260.tar.bz2 Message: Fatal FTP response

Details: 550 Error retrieving /Users/gmerino/0101/PFFilt_PhysicsFiltering_Run00129005_Subrun00000000_00000260.tar.bz2: Transfer was forcefully killed\r\n


2018-03-15 12:02 am connection failed

{ "context": [ { "endpoint": "Prometheus DESY test server (eca64ab0-b811-11e7-b125-22000a92523b)", "operation": "File Transfer - Capability Check" } ], "error": { "details": "Error (connect)\nEndpoint: Prometheus DESY test server (eca64ab0-b811-11e7-b125-22000a92523b)\nServer: prometheus.desy.de:2811\nMessage: Could not connect to server\n---\nDetails: globus_xio: Unable to connect to prometheus.desy.de:2811\nglobus_xio: System error in connect: Connection refused\nglobus_xio: A system call failed: Connection refused\n\n", "type": "GSHError" } }

On 26 February 2018 at 06:59, Paul Millar notifications@github.com wrote:

My apologies for the delay in replying.

There was an unresolved issue with the update that prevented me from updating dCache so it supports the Globus transfer-service. It turns out the problem was not with the patch, but with the existing dCache code. That problem is now fixed, so I've deployed the patch.

Currently the patch is in our 'master' branch. This means it appears in prometheus test system right now, so you should be able to verify that it works there.

We will back-port the fix to our stable branches, going back to dCache v3.2. It's too late to do that for this release cycle (due out tomorrow), but it should be available as part of the next release cycle (due next Tuesday: 2018-03-06).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/3545#issuecomment-368494653, or mute the thread https://github.com/notifications/unsubscribe-auth/AHjryV1KMDEnnFyNsX79nAZNd5OMg_6yks5tYqq0gaJpZM4PiwJ- .

paulmillar commented 6 years ago

Thanks for doing this testing, Gonzalo.

Every day, at 06:00 CET/CEST, prometheus is wiped clean and reinstalled from scratch. This isn't normal dCache behaviour -- it's something special to prometheus, as it always has the latest dCache version.

I believe that (currently) this time corresponds to midnight in Central US time. Looking at the logs, I see the Globus transfer service connections (acting on your behalf), starting 2018-03-15T05:13:00.669+0100, with the last one connecting 2018-03-15T08:07:40.427+0100.

So, I believe this explains the "few errors" you described.

Could you retry the transfers, starting them somewhat earlier, to try and avoid midnight?

paulmillar commented 3 years ago

There doesn't seem to have been much progress on this ticket.

To be clear, I believe this problem is solved. I am able to transfer many files between two dCache instances using Globus.