dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
291 stars 136 forks source link

XRootD IPV6 on proxy mode #6875

Closed cfgamboa closed 1 year ago

cfgamboa commented 2 years ago

Dear all,

It seems that XROOT proxy mode is not redirecting IPV6 Client requests via xrootd.net.internal interface.

I am using dCache 8.2.4

This request works if I specify the IPV4 IP address:
[cgamboa@lxplus732 ~]$ xrdcp root://192.12.15.15:1096/pnfs/usatlas.bnl.gov/qostape/testbatchtest/test_carlos_new_archiver_pool_old_pool_3 /tmp/test.1 -f
[4.738MB/4.738MB][100%][==================================================][1.579MB

If the IPV6 is used to issue the request the transfer fails

[cgamboa@lxplus732 ~]$ xrdcp root://dcqos002.usatlas.bnl.gov:1096/pnfs/usatlas.bnl.gov/qostape/testbatchtest/test_carlos_new_archiver_pool_old_pool_3 /tmp/test.1 -f
[0B/0B][100%][==================================================][0B/s]  
Run: [ERROR] Server responded with an error: [3012] Failed to open file (General problem: Unable to find address that faces lxplus732.cern.ch/2001:1458:d00:1:0:0:100:42f [666]) (source)

The xroot log extract:

1 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg] XrootdSessionHandler.getResponse: Request 3010
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg] [id: 0xda694968, L:/2620:0:210:1:0:0:0:f:1096 - R:/2001:1458:d00:1:0:0:100:42f:50906] READ COMPLETE
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg] open[0,1104,pnfs/usatlas.bnl.gov/qostape/testbatchtest/test_carlos_new_archiver_pool_old_pool_3,] –– not a third-party request.
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg] Opening pnfs/usatlas.bnl.gov/qostape/testbatchtest/test_carlos_new_archiver_pool_old_pool_3 for read
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg] open flags: options to apply for open path (raw=1104 ): kXR_async kXR_open_read kXR_retstat
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg] mode to apply to open path: --- --- ---
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg] OPAQUE : {}
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg] tried null, triedrc null, ignoring.
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg] Status: PnfsManager: Fetching storage info
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg] Status: PoolManager: Selecting pool
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg] Status: Pool PoolName=dcqos005_4 PoolAddress=dcqos005_4@dcqos005fourDomain: Creating mover
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg] Status: Mover PoolName=dcqos005_4 PoolAddress=dcqos005_4@dcqos005fourDomain/134217778: Waiting for redirect
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg dcqos005_4 DoorTransferFinished 000055A20830F08D410CABBBA29EC974CC27] Transfer 000055A20830F08D410CABBBA29EC974CC27@PoolName=dcqos005_4 PoolAddress=dcqos005_4@dcqos005fourDomain failed: General problem: Unable to find address that faces lxplus732.cern.ch/2001:1458:d00:1:0:0:100:42f (error code=666)
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg] Xrootd-Error-Response: [session B83FE96F5CF41DE8BB9EE417872BCE3E][connection [id: 0xda694968, L:/2620:0:210:1:0:0:0:f:1096 - R:/2001:1458:d00:1:0:0:100:42f:50906]][request 3010 kXR_open](error 3012, kXR_ServerError, Failed to open file (General problem: Unable to find address that faces lxplus732.cern.ch/2001:1458:d00:1:0:0:100:42f [666])).
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg] [id: 0xda694968, L:/2620:0:210:1:0:0:0:f:1096 - R:/2001:1458:d00:1:0:0:100:42f:50906] WRITE: error[3012,Failed to open file (General problem: Unable to find address that faces lxplus732.cern.ch/2001:1458:d00:1:0:0:100:42f [666])]

The pool sees the external client IP not the proxy door IP, see below:

21 Nov 2022 14:32:53 (dcqos005_4) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg Xrootd-dcqos002-proxy PoolDeliverFile 000055A20830F08D410CABBBA29EC974CC27] Transfer failed: java.net.SocketException: Unable to find address that faces lxplus732.cern.ch/2001:1458:d00:1:0:0:100:42f

The transfer should use the internal interface as:

[xrootd-${host.name}Domain] [xrootd-${host.name}Domain/xrootd] xrootd.cell.name=Xrootd-${host.name}-proxy xrootd.net.port=1096 xrootd.net.proxy-transfers=true xrootd.net.internal=10.42.38.49

Could you please advise

All the best, Carlos

alrossi commented 2 years ago

Hi Carlos,

I am currently deep into solving a set of issues which are rather urgent in an area completely unrelated to xrootd, so I cannot address this quickly. I will try to get to it as soon as I can.

Al

cfgamboa commented 2 years ago

Thank you Al.

alrossi commented 2 years ago

If you could in the meantime actually open an RT ticket for this with

full client -d3 output

Thanks.

Al

DmitryLitvintsev commented 2 years ago

Hi Carlos,

real quick question: do webdav and GFTP work in the same situation?

Dmitry

cfgamboa commented 2 years ago

No webdav works fine.

alrossi commented 2 years ago

Carlos, still working on these other unrelated issues, so I wanted to tell you I will probably not have time to look at this until next week.

In the meantime: if you could collect that client debug log I asked for, that would be helpful.

Also, when you say you are using 8.2.4, did you ever use a prior version of 8.2 (like 8.2.1?) which worked for this? I'm surprised you are discovering this now when I thought the initial testing was positive. Or did you switch network configurations? Just want to know if this worked before, or whether it has never worked.

cfgamboa commented 2 years ago

Hello, The use case here is for a pool that do not support IPV6. Is there any change in from 8.2.2 on 8.2.4 on dCache that could contribute on this ?

All the best, Carlos

alrossi commented 2 years ago

Certainly no change to xroot. Whether some other change remains to be seen. Did this work for 8.2.2?

Al

cfgamboa commented 1 year ago

The test and deployment of 8.2.2 was for a dCache instance with different use case. There the pools do have dual IPV4/IPV6 stack.

alrossi commented 1 year ago

What did you set

xrootd.net.internal=

to?

alrossi commented 1 year ago
21 Nov 2022 14:32:53 (Xrootd-dcqos002-proxy) [door:Xrootd-dcqos002-proxy@xrootd-dcqos002Domain:AAXuAB76Cfg dcqos005_4 DoorTransferFinished 000055A20830F08D410CABBBA29EC974CC27] Transfer 000055A20830F08D410CABBBA29EC974CC27@PoolName=dcqos005_4 PoolAddress=dcqos005_4@dcqos005fourDomain failed: General problem: Unable to find address that faces lxplus732.cern.ch/2001:1458:d00:1:0:0:100:42f (error code=666)

I'm confused. This looks like the poolManager is trying to match the pool to the client. The client should not be connecting to the pool. The choice of the pool should be on the basis of the door, not the client. I am rather surprised this is happening.

It really would be helpful if I could see the client logs and also your door configuration.

Thanks.

cfgamboa commented 1 year ago

Al

Please take at look at the configuration posted before

[xrootd-${host.name}Domain]
[xrootd-${host.name}Domain/xrootd]
xrootd.cell.name=Xrootd-${host.name}-proxy
xrootd.net.port=1096
xrootd.net.proxy-transfers=true
xrootd.net.internal=10.42.38.49
alrossi commented 1 year ago

Ah yes, sorry, lost in the delay.

cfgamboa commented 1 year ago

The pool is running on IPV4

bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
        inet 10.42.64.26  netmask 255.255.254.0  broadcast 10.42.65.255
        ether b8:59:9f:3a:38:34  txqueuelen 1000  (Ethernet)
        RX packets 4177676  bytes 1167537379 (1.0 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8953448  bytes 8660230360 (8.0 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
cfgamboa commented 1 year ago

The door interfaces are, it appears that the door is using the IPV6 component to interact to the internal pool on IPV4

bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
        inet 192.12.15.15  netmask 255.255.255.0  broadcast 192.12.15.255
        inet6 fe80::9a03:9bff:fe89:e1fe  prefixlen 64  scopeid 0x20<link>
        inet6 2620:0:210:1::f  prefixlen 64  scopeid 0x0<global>
        ether 98:03:9b:89:e1:fe  txqueuelen 1000  (Ethernet)
        RX packets 47225761  bytes 4285494676 (3.9 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1326621  bytes 712014232 (679.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
        inet 10.42.38.49  netmask 255.255.255.0  broadcast 10.42.38.255
        inet6 2620:0:210:8803::49  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::9a03:9bff:fe04:736  prefixlen 64  scopeid 0x20<link>
        ether 98:03:9b:04:07:36  txqueuelen 1000  (Ethernet)
        RX packets 161305582  bytes 212950321177 (198.3 GiB)
        RX errors 0  dropped 2  overruns 0  frame 0
        TX packets 30543840  bytes 20273395118 (18.8 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
alrossi commented 1 year ago

Yes I understand that. What I don't understand is why the PoolManager is using the client's address to match the pool.

Maybe there is an edge case here which was neglected and disguised in the case of dual stack.

alrossi commented 1 year ago

lxplus732.cern.ch/2001:1458:d00:1:0:0:100:42f is the client, right?

alrossi commented 1 year ago

I think I see what the problem is here. I'll get back to you.

cfgamboa commented 1 year ago

Yes

On Nov 28, 2022, at 10:37 AM, Albert Rossi @.***> wrote:

lxplus732.cern.ch/2001:1458:d00:1:0:0:100:42f is the client, right?

— Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/6875#issuecomment-1329312262, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIHMO5EEQ527N3HP4ERBZLWKTGURANCNFSM6AAAAAASG75GKI. You are receiving this because you authored the thread.

cfgamboa commented 1 year ago

Thank you.

On Nov 28, 2022, at 10:40 AM, Albert Rossi @.***> wrote:

I think I see what the problem is here. I'll get back to you.

— Reply to this email directly, view it on GitHub https://github.com/dCache/dcache/issues/6875#issuecomment-1329315944, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIHMOZFFA434ENRFABA5Q3WKTG67ANCNFSM6AAAAAASG75GKI. You are receiving this because you authored the thread.

alrossi commented 1 year ago

Carlos,

I have figured out what the issue is. Let me explain it to you so you can follow why I will need to discuss this with the group tomorrow before taking action.

When the transfer comes into the door, the door has to do, among other things, the following:

  1. ask the PoolManager to select a pool
  2. start the mover on the pool and return the redirect to the client

Now, when we added the internal address, it is that address we use to select the pool (1) for the proxy. However, given the current state of the code which is shared across various doors (not just xroot), we decided we should continue passing the original client address for (2) because when the mover is started, billing is updated, and we wanted the billing entry to reflect the actual user/external client, not the proxy/door client.

When the pools are dual stack, this is not a problem. But if the pool does not support IPv6, the mover start fails because it thinks it needs to connect to the client.

There are two possible solutions here:

The second solution is better, of course, but it will entail more changes which are potentially disruptive to other protocols.

Before taking the second solution, however, I would like to get the opinion of the rest of the team.

Can you live with the delay, or do you prefer a quick fix (which will scramble your billing records) and then just update to the better fix when it is provided (which undoubtedly would mean passing over to the 9.0+ version)?

Cheers, Al

alrossi commented 1 year ago

Carlos,

The team consensus is that we go with the first solution; it is not crucial that the billing record reflect the original client IP; that data can be obtained from the door record, which can be associated with the billing record through door.transaction=billing.initiator.

I will be posting a patch soon.

Al

alrossi commented 1 year ago

https://rb.dcache.org/r/13807/