dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
285 stars 136 forks source link

LinkLocal addresses used by xroot checksum redirection (failing xroot uploads) #6884

Closed vokac closed 1 year ago

vokac commented 1 year ago

Uploading file with checksum validation using xroot protocol fails with Invalid redirect URL, because dCache try to redirect client to the dCache headnode link local IPv6 address. Why dCache even consider local addresses to be used during transfers?

$ xrdfs root://se1.farm.particle.cz:1094 rm /dteam/test.chksum
$ XRD_LOGLEVEL=Dump xrdcopy --cksum ADLER32:print /etc/services root://se1.farm.particle.cz:1094//dteam/test.chksum
...
[2022-11-27 18:24:54.705785 +0100][Debug  ][Utility           ] Attempting checksum calculation, mode: target.
[2022-11-27 18:24:54.705828 +0100][Dump   ][Utility           ] URL: root://ds31.farm.particle.cz:22346//dteam/test.chksum?org.dcache
.uuid=a72e7bd8-fb81-4e7a-966a-217982ca6ea5&org.dcache.xrootd.client=vokac.87301@ui2.farm.particle.cz&oss.asize=670293&xrd.logintoken=
org.dcache.door=fe80:0:0:0:5054:ff:fef1:ef79:1094&xrdcl.requuid=9171bf49-6579-44a2-a5b2-540b411252be
...
[2022-11-27 18:24:54.705961 +0100][Dump   ][FileSystem        ] [0xcd5640@ds31.farm.particle.cz:22346] Sending kXR_query (code: kXR_Qcksum, arg length: 36)
[2022-11-27 18:24:54.705970 +0100][Dump   ][XRootD            ] [ds31.farm.particle.cz:22346] Sending message kXR_query (code: kXR_Qcksum, arg length: 36)
[2022-11-27 18:24:54.705982 +0100][Debug  ][ExDbgMsg          ] [ds31.farm.particle.cz:22346] MsgHandler created: 0xcd8170 (message: kXR_query (code: kXR_Qcksum, arg length: 36) ).
[2022-11-27 18:24:54.705995 +0100][Dump   ][PostMaster        ] [ds31.farm.particle.cz:22346] Sending message kXR_query (code: kXR_Qcksum, arg length: 36) (0xcd7a90) through substream 0 expecting answer at 0
[2022-11-27 18:24:54.706071 +0100][Dump   ][AsyncSock         ] [ds31.farm.particle.cz:22346.0] Wrote a message: kXR_query (code: kXR_Qcksum, arg length: 36) (0xcd7a90), 60 bytes
[2022-11-27 18:24:54.706110 +0100][Dump   ][AsyncSock         ] [ds31.farm.particle.cz:22346.0] Successfully sent message: kXR_query (code: kXR_Qcksum, arg length: 36) (0xcd7a90).
[2022-11-27 18:24:54.706127 +0100][Dump   ][XRootD            ] [ds31.farm.particle.cz:22346] Message kXR_query (code: kXR_Qcksum, arg length: 36) has been successfully sent.
[2022-11-27 18:24:54.706135 +0100][Debug  ][ExDbgMsg          ] [ds31.farm.particle.cz:22346] Moving MsgHandler: 0xcd8170 (message: kXR_query (code: kXR_Qcksum, arg length: 36) ) from out-queu to in-queue.
[2022-11-27 18:24:54.706143 +0100][Dump   ][PostMaster        ] [ds31.farm.particle.cz:22346.0] All messages consumed, disable uplink
[2022-11-27 18:24:54.706476 +0100][Dump   ][XRootDTransport   ] [msg: 0xfc000960] Expecting 32 bytes of message body
[2022-11-27 18:24:54.706497 +0100][Dump   ][AsyncSock         ] [ds31.farm.particle.cz:22346.0] Received message header for 0xfc000960 size: 8
[2022-11-27 18:24:54.706505 +0100][Debug  ][ExDbgMsg          ] [msg: 0xfc000960] Assigned MsgHandler: 0xcd8170.
[2022-11-27 18:24:54.706512 +0100][Debug  ][ExDbgMsg          ] [handler: 0xcd8170] Removed MsgHandler: 0xcd8170 from the in-queue.
[2022-11-27 18:24:54.706524 +0100][Dump   ][AsyncSock         ] [ds31.farm.particle.cz:22346.0] Received message 0xfc000960 of 40 bytes
[2022-11-27 18:24:54.706532 +0100][Dump   ][PostMaster        ] [ds31.farm.particle.cz:22346] Handling received message: 0xfc000960.
[2022-11-27 18:24:54.706623 +0100][Dump   ][XRootD            ] [ds31.farm.particle.cz:22346] Got kXR_redirect response to message kXR_query (code: kXR_Qcksum, arg length: 36): fe80:0:0:0:5054:ff:fef1:ef79, port 1094
[2022-11-27 18:24:54.706683 +0100][Error  ][XRootD            ] [ds31.farm.particle.cz:22346] Got invalid redirection URL: fe80:0:0:0:5054:ff:fef1:ef79
[2022-11-27 18:24:54.706703 +0100][Debug  ][ExDbgMsg          ] [ds31.farm.particle.cz:22346] Calling MsgHandler: 0xcd8170 (message: kXR_query (code: kXR_Qcksum, arg length: 36) ) with status: [ERROR] Invalid redirect URL.
...
Run: [ERROR] Invalid redirect URL:  Got an error while querying the checksum! (destination)
...

It seems to me that dCache use first discovered interface address, but that's IPv6 link local in our case, e.g. with

import java.net.Inet6Address;
import java.net.InetAddress;
import java.net.NetworkInterface;
import java.net.SocketException;
import java.util.Enumeration;

public class App {
    public static void main(String[] args) throws Exception {
        Enumeration<NetworkInterface> interfaces =
        NetworkInterface.getNetworkInterfaces();
        while (interfaces.hasMoreElements()) {
            NetworkInterface i = interfaces.nextElement();
            try {
                if (i.isUp() && !i.isLoopback()) {
                    Enumeration<InetAddress> e = i.getInetAddresses();
                    while (e.hasMoreElements()) {
                        InetAddress address = e.nextElement();
                        String name = address.getCanonicalHostName();
                        if (address instanceof Inet6Address) {
                            int ii = name.indexOf('%');
                            if (ii > 0) {
                                name = name.substring(0, ii);
                            }
                        }
                        System.out.println(address);
                        System.out.printf("  name: %s\n", name);
                        System.out.printf("  addr: %s\n", InetAddress.getByAddress(name, address.getAddress()));
                        System.out.printf("  linklocal: %b\n", address.isLinkLocalAddress());
                    }
                }
            } catch (SocketException e) {
                System.out.printf("Not publishing NIC {}: {}", i.getName(), e.getMessage());
            }
        }
    }
}

I'm getting following output on se1.farm.particle.cz

/fe80:0:0:0:5054:ff:fef1:ef79%eth1
  name: fe80:0:0:0:5054:ff:fef1:ef79
  addr: fe80:0:0:0:5054:ff:fef1:ef79/fe80:0:0:0:5054:ff:fef1:ef79
  linklocal: true
/2001:718:401:6017:2:0:0:1000%eth1
  name: se1.farm.particle.cz
  addr: se1.farm.particle.cz/2001:718:401:6017:2:0:0:1000
  linklocal: false
/172.16.0.100
  name: se1.farm.particle.cz
  addr: se1.farm.particle.cz/172.16.0.100
  linklocal: false
/fe80:0:0:0:5054:ff:fe8e:c89a%eth0
  name: fe80:0:0:0:5054:ff:fe8e:c89a
  addr: fe80:0:0:0:5054:ff:fe8e:c89a/fe80:0:0:0:5054:ff:fe8e:c89a
  linklocal: true
/147.231.25.100
  name: se1.farm.particle.cz
  addr: se1.farm.particle.cz/147.231.25.100
  linklocal: false
DmitryLitvintsev commented 1 year ago

Hi Petr,

Just for kicks (not as a real solution). What happens if you specify IPv6 to host mapping in /etc/hosts on the door node.

Thanks, Dmitry

vokac commented 1 year ago

No difference.

I guess Java also use normal NETLINK interface on linux, but Java returns interfaces/addresses in reverse order compared getifaddrs example:

[root@se1.farm.particle.cz /tmp]# ./a.out 
Internet Address:  (null) 
LineDescription :  lo 
Broadcast Addr  :  (null) 

Internet Address:  (null) 
LineDescription :  eth0 
Broadcast Addr  :  (null) 

Internet Address:  (null) 
LineDescription :  eth1 
Broadcast Addr  :  (null) 

Internet Address:  127.0.0.1 
LineDescription :  lo 
Netmask         :  255.0.0.0 
Broadcast Addr  :  127.0.0.1 

Internet Address:  147.231.25.100 
LineDescription :  eth0 
Netmask         :  255.255.255.0 
Broadcast Addr  :  147.231.25.255 

Internet Address:  172.16.0.100 
LineDescription :  eth1 
Netmask         :  255.255.0.0 
Broadcast Addr  :  172.16.255.255 

Internet Address:  ::1 
LineDescription :  lo 
Netmask         :  ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff 

Internet Address:  fe80::5054:ff:fe8e:c89a 
LineDescription :  eth0 
Netmask         :  ffff:ffff:ffff:ffff:: 

Internet Address:  2001:718:401:6017:2::1000 
LineDescription :  eth1 
Netmask         :  ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff 

Internet Address:  fe80::5054:ff:fef1:ef79 
LineDescription :  eth1 
Netmask         :  ffff:ffff:ffff:ffff:: 
DmitryLitvintsev commented 1 year ago

Another question - if checksum verification is not requested - does it work? (just narrowing the space).

vokac commented 1 year ago

Yes, transfer without checksum verification works. Actually file is transferred and also checksum works if it is executed as separate command, e.g.

$ xrdfs root://se1.farm.particle.cz:1094 rm /dteam/test.chksum

$ xrdcopy --cksum ADLER32:print /etc/services root://se1.farm.particle.cz:1094//dteam/test.chksum
[654.6kB/654.6kB][100%][==================================================][654.6kB/s]  
Run: [ERROR] Invalid redirect URL:  Got an error while querying the checksum! (destination)

$ xrdfs root://se1.farm.particle.cz:1094 query checksum /dteam/test.chksum
adler32 6408a0a8

Fortunately Rucio file upload doesn't ask for checksum during copy operation => two independent gfal2 operations (copy, checksum) => this issue doesn't affect our production workflows.

alrossi commented 1 year ago

Hi Petr,

Might I ask you what version of the xrootd client you are using?

Thanks, Al


Albert L. Rossi Senior Software Developer Scientific Computing Division, Scientific Data Services, Distributed Data Development WH 566 Fermi National Accelerator Laboratory Batavia, IL 60510 (630) 840-3023


From: vokac @.> Sent: Monday, November 28, 2022 10:48 AM To: dCache/dcache @.> Cc: Subscribed @.***> Subject: Re: [dCache/dcache] LinkLocal addresses used by xroot checksum redirection (failing xroot uploads) (Issue #6884)

Yes, transfer without checksum verification works. Actually file is transferred and also checksum works if it is executed as separate command, e.g.

$ xrdfs root://se1.farm.particle.cz:1094 rm /dteam/test.chksum

$ xrdcopy --cksum ADLER32:print /etc/services root://se1.farm.particle.cz:1094//dteam/test.chksum [654.6kB/654.6kB][100%][==================================================][654.6kB/s] Run: [ERROR] Invalid redirect URL: Got an error while querying the checksum! (destination)

$ xrdfs root://se1.farm.particle.cz:1094 query checksum /dteam/test.chksum adler32 6408a0a8

Fortunately Rucio file upload doesn't ask for checksum during copy operation => two independent gfal2 operations (copy, checksum) => this issue doesn't affect our production workflows.

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dCache_dcache_issues_6884-23issuecomment-2D1329419161&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=60rQ0HHqHmEY1P6VSdyuTQ&m=ctCxuYjjwP2Ho6q68DmDJ46gOZxnZs0VxCsNWJ6hSjcahCF8Fg0p5tivnG9JBgWN&s=ivl2R6MQjaeV1mcDb2ta7iAkXEs9S8r_rnUfm2A7i_E&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AA6NBHDEOY4MIF2K447AY6LWKTO5JANCNFSM6AAAAAASMTZCMI&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=60rQ0HHqHmEY1P6VSdyuTQ&m=ctCxuYjjwP2Ho6q68DmDJ46gOZxnZs0VxCsNWJ6hSjcahCF8Fg0p5tivnG9JBgWN&s=aDrr8OyQ6JfduMhB141v0OmIOhe16XR6IxUcX6zyggs&e=. You are receiving this because you are subscribed to this thread.Message ID: @.***>

vokac commented 1 year ago

Latest version from EPEL7

$ rpm -qa xrootd-client
xrootd-client-5.5.1-1.el7.x86_64
alrossi commented 1 year ago

Petr, I have a potential fix.

If I send you a link to the .jar file would you be able to deploy and test?

do:

rm -f /usr/share/dcache/classes//dcache-xrootd-*.jar

drop the new .jar into that directory

and restart the domain.

Thanks - Al

alrossi commented 1 year ago

What version of dcache are you running?

vokac commented 1 year ago

Sure, I can try your update. 8.2.2 on the headnode (se1.farm.particle.cz) and 8.2.0 on door/pool nodes

alrossi commented 1 year ago

Sent you the link from my Google drive.

vokac commented 1 year ago

With your dcache-xrootd-8.2.6-SNAPSHOT.jar xroot transfer with checksum works, details:

$ XRD_LOGLEVEL=Dump xrdcopy --cksum ADLER32:print /etc/services root://se1.farm.particle.cz:1094//dteam/test.chksum
[2022-11-28 23:00:56.011449 +0100][Debug  ][Utility           ] Initializing xrootd client version: v5.5.1
...
[2022-11-28 23:00:57.024942 +0100][Debug  ][Utility           ] Attempting checksum calculation, mode: target.
[2022-11-28 23:00:57.025007 +0100][Dump   ][Utility           ] URL: root://dpmpool21.farm.particle.cz:23398//dteam/test.chksum?org.dcache.uuid=790b66ef-ba1b-499d-a386-eacfd7768319&org.dcache.xrootd.client=vokac.23233@ui2.farm.particle.cz&oss.asize=670293&xrd.logintoken=org.dcache.door=se1.farm.particle.cz:1094&xrdcl.requuid=36048a63-fb7e-4b1c-aa7e-f55668c78782
...
[2022-11-28 23:00:57.035275 +0100][Dump   ][XRootD            ] [dpmpool21.farm.particle.cz:23398] Got kXR_redirect response to message kXR_query (code: kXR_Qcksum, arg length: 36): se1.farm.particle.cz, port 1094
...
[2022-11-28 23:00:57.044068 +0100][Dump   ][FileSystem        ] [0x948ea0@dpmpool21.farm.particle.cz:23398] Assigning dpmpool21.farm.particle.cz:23398 as last URL
[2022-11-28 23:00:57.044115 +0100][Debug  ][XRootD            ] Redirect trace-back:
[2022-11-28 23:00:57.044115 +0100][Debug  ][XRootD            ]         0. Redirected from: root://dpmpool21.farm.particle.cz:23398//dteam/test.chksum to: root://se1.farm.particle.cz:1094/
[2022-11-28 23:00:57.044130 +0100][Debug  ][ExDbgMsg          ] [se1.farm.particle.cz:1094] Destroying MsgHandler: 0x94bd80.
[2022-11-28 23:00:57.044153 +0100][Dump   ][Utility           ] Checksum for /dteam/test.chksum checksum: adler32:6408a0a8
[2022-11-28 23:00:57.044294 +0100][Debug  ][File              ] [0x947da0@file://localhost/etc/services?xrdcl.requuid=817c273c-1794-4645-baef-6dd14bcbbba4] Sending a close command for handle 0xe to localhost
[2022-11-28 23:00:57.044397 +0100][Debug  ][File              ] [0x947da0@file://localhost/etc/services?xrdcl.requuid=817c273c-1794-4645-baef-6dd14bcbbba4] Close returned from localhost with: [SUCCESS] 
[2022-11-28 23:00:57.044422 +0100][Dump   ][File              ] [0x947da0@file://localhost/etc/services?xrdcl.requuid=817c273c-1794-4645-baef-6dd14bcbbba4] Items in the fly 0, queued for recovery 0
adler32: 6408a0a8 root://se1.farm.particle.cz:1094/dteam/test.chksum 670293
...
alrossi commented 1 year ago

Great. Thanks Petr.

(I'll be looking at the other problem with xroots / ls soon).

alrossi commented 1 year ago

https://rb.dcache.org/r/13798/