deeplearning4j / deeplearning4j-examples

Deeplearning4j Examples (DL4J, DL4J Spark, DataVec)
http://deeplearning4j.konduit.ai
Other
2.45k stars 1.82k forks source link

DL4J uses SharedTrainingMaster on spark and reports "ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000" #1060

Closed byanjie closed 2 years ago

byanjie commented 2 years ago

Issue Description

The node IP is 10.0.6.201~204. When configuring VoidConfiguration, set networkMask=10.0.6.0/16, and the UDP port is 40123. When deployed in spark Standalone mode, it keeps reporting an error: "Caused by: io.aeron.exceptions.ConfigurationException: ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000", the spark cluster server firewall is all closed.

中文描述: 节点IP是10.0.6.201~204,配置VoidConfiguration时,设置networkMask=10.0.6.0/16,UDP端口为40123,spark Standalone 模式部署下时,一直报错:“Caused by: io.aeron.exceptions.ConfigurationException: ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000”,spark集群服务器防火墙是全部关闭的。

Version Information

Please indicate relevant versions, including, if relevant:

Contributing

If you'd like to help us fix the issue by contributing some code, but would like guidance or help in doing so, please mention it!

byanjie commented 2 years ago

Issue Description

The node IP is 10.0.6.201~204. When configuring VoidConfiguration, set networkMask=10.0.6.0/16, and the UDP port is 40123. When deployed in spark Standalone mode, it keeps reporting an error: "Caused by: io.aeron.exceptions.ConfigurationException: ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000", the spark cluster server firewall is all closed.

中文描述: 节点IP是10.0.6.201~204,配置VoidConfiguration时,设置networkMask=10.0.6.0/16,UDP端口为40123,spark Standalone 模式部署下时,一直报错:“Caused by: io.aeron.exceptions.ConfigurationException: ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000”,spark集群服务器防火墙是全部关闭的。

Version Information

Please indicate relevant versions, including, if relevant:

Contributing

If you'd like to help us fix the issue by contributing some code, but would like guidance or help in doing so, please mention it!

agibsonccc commented 2 years ago

@byanjie tweak the liveliness configuration in aeron itself the error is right there: Caused by: io.aeron.exceptions.ConfigurationException: ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000”

You may find all of the relevant aeron overrides here: https://github.com/real-logic/aeron/blob/master/aeron-driver/src/main/java/io/aeron/driver/Configuration.java

If you want further support please post over on the community forums: https://community.konduit.ai/ - this repo is not monitored that much.

byanjie commented 2 years ago

But I have configured "setProperty("aeron.publication.unblock.timeout", "60000000000");", but this error will still be reported in spark cluster mode.

中文解释: 但是我已经配置了“setProperty("aeron.publication.unblock.timeout", "60000000000");”,但是在spark 集群模式下任然会报这个错误。

byanjie commented 2 years ago

@byanjie调整aeron自己的运行配置,错误就: 原因:ioexceptions.ConfigurationException:-publicationUnblockTimeout00 = 150000000 <= clientLivenessTimeoutNs = 3000”

您可以在这里找到所有相关的 aeron 覆盖:https://github.com/real-logic/aeron/blob/master/aeron-driver/src/main/java/io/aeron/driver/Configuration.java

如果您需要进一步的支持,在社区论坛上发帖:https://community.konduit.ai/ - 这个 repo 没有受到过多的监控。

@byanjie tweak the liveliness configuration in aeron itself the error is right there: Caused by: io.aeron.exceptions.ConfigurationException: ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000”

You may find all of the relevant aeron overrides here: https://github.com/real-logic/aeron/blob/master/aeron-driver/src/main/java/io/aeron/driver/Configuration.java

If you want further support please post over on the community forums: https://community.konduit.ai/ - this repo is not monitored that much.

But I have configured "setProperty("aeron.publication.unblock.timeout", "60000000000");", but this error will still be reported in spark cluster mode

byanjie commented 2 years ago

@byanjie tweak the liveliness configuration in aeron itself the error is right there: Caused by: io.aeron.exceptions.ConfigurationException: ERROR - publicationUnblockTimeoutNs=15000000000 <= clientLivenessTimeoutNs=30000000000”

You may find all of the relevant aeron overrides here: https://github.com/real-logic/aeron/blob/master/aeron-driver/src/main/java/io/aeron/driver/Configuration.java

If you want further support please post over on the community forums: https://community.konduit.ai/ - this repo is not monitored that much.

But I have configured "setProperty("aeron.publication.unblock.timeout", "60000000000");", but this error will still be reported in spark cluster mode

agibsonccc commented 2 years ago

@byanjie post more on the community forums then please. If you want help and you're asking for our time here the least you can do is go where other people can benefit from our discussion. Thanks.