Closed DmitryLitvintsev closed 1 year ago
I might have understood the problem. I enabled DEBUG and I see:
14 Apr 2023 16:24:26 (WebDAV00-fndcadoor01) [] [Producer clientId=WebDAV00-fndcadoor01@webdav-fndcadoor01Domain-1] Connection with lssrv06.fnal.gov/2620:6a:
0:8430:f0:0:240:103 disconnected
java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
at org.apache.kafka.common.network.PlaintextTransportLayer.finishConnect(PlaintextTransportLayer.java:50)
at org.apache.kafka.common.network.KafkaChannel.finishConnect(KafkaChannel.java:224)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:526)
at org.apache.kafka.common.network.Selector.poll(Selector.java:481)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:560)
at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:328)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:243)
at java.base/java.lang.Thread.run(Thread.java:829)
so connection fails. Why?
# telnet lssrv06 9092
Trying 2620:6a:0:8430:f0:0:240:103...
telnet: connect to address 2620:6a:0:8430:f0:0:240:103: Connection refused
Trying 131.225.240.103...
Connected to lssrv06.fnal.gov.
Escape character is '^]'.
Looks like IPv6 firewall is blocking port 9092 and since 8.2 dCache prefers IPV6. I just made this hack:
[root@fndcadoor01 ~]# !d
diff -uN /usr/share/dcache/defaults/dcache.properties /usr/share/dcache/defaults/dcache.properties.orig
--- /usr/share/dcache/defaults/dcache.properties 2023-04-14 16:28:07.965725909 -0500
+++ /usr/share/dcache/defaults/dcache.properties.orig 2023-04-14 16:27:58.498739815 -0500
@@ -237,6 +237,7 @@
-XX:+StartAttachListener \
-XX:+UseCompressedOops \
-javaagent:${dcache.paths.classes}/aspectjweaver-1.9.2.jar \
+ -Djava.net.preferIPv6Addresses=system \
${dcache.java.options.common} \
${dcache.java.options.extra}=
restarted and I am getting events into Kafka...
da994201 (Paul Millar 2022-02-15 12:07:33 +0100 200) -Djava.net.preferIPv6Addresses=system \
commit da994201f1e90191030a631c54241c8bbe240565
Author: Paul Millar <paul.millar@desy.de>
Date: Tue Feb 15 12:07:33 2022 +0100
scripts: configure JVM to use OS-supplied IP address order
Motivation:
WLCG wishes to see HTTP-TPC transfers use IPv6 addresses when
transferring data between two dual-stack machines.
Modification:
Update the `java` command-line to specify the system configuration
property `java.net.preferIPv6Addresses`. This property controls in
which order addresses are returned when resolving a hostname. This, in
turn, controls which protocol version (IPv4 or IPv6) is used
preferentially.
There are three valid values: `false`, `true` and `system`. By
selecting `system`, the script has Java honour the order provided by the
OS. The GNU resolver library adopts RFC 3484, and provides admins with
the ability to tweak the behaviour (as needed).
This patch also reverts commit 755ee365e5. That patch provided a
solution to this problem that was safer to back-port. With this patch,
there is no longer any need to have a custom DnsResolver instance.
Result:
dCache now uses the system-supplied ordering of IP addresses when
resolving a host name. On GNU-based systems (e.g., Linux systems), the
order follows RFC 3484 and may be tweaked via the `/etc/gai.conf` file.
This provides admins with better control over which address is used when
dCache establishes network connections with other services; e.g.,
HTTP-TPC pull transfers. In particular, using IPv6 (instead of IPv4)
where possible.
Target: master
Requires-notes: true
Requires-book: no
Patch: https://rb.dcache.org/r/13443/
Acked-by: Tigran Mkrtchyan
After we upgraded to 8.2 we no longer are getting events into Kafka. We have 3 dCache instances. One 7.2 remaining still publishing to Kafka with no issue.
I have tried:
And also I have tried to comment all of the above and use defaults - same results:
I must add that the decision to sprinkle Kafka interaction across services was not the best idea. Trying different configs I have to restart doors and pools ..... Unacceptable in production. If instead we limited Kafka interactions to billing cell, no transfers would be affected while debugging
Also note, that the version we used has print format error that ends up dumping metric ton of data into log files which are filling up log partitions.
Another bit:
Just doing random restores one at a time on ITB, I see events going to Kafka. Production seem to report some of the events. So is not a 100% fail. But it mostly fails.
Why is this bad? We want to rely on storage events for data handling. Right now this is completely broken.
This is 8.2.13
I am yet to deploy format fix that Tigran addd recently.
Correction: No. I see no events from any 8.2 installations except one domain - NFS domain. That one successfully reports to Kafka. No config differences between NFS domain and other doors/pools with regard to Kafka.