OpenClovis / SAFplus-Availability-Scalability-Platform

Middleware that provides libraries, GUI, and code generator to design multi-node (clustered) applications that are highly available, redundant, and scalable. Provides sub-second node and application fault detection and failover, and useful application libraries including distributed hash tables (checkpoint), event, logging, and communications. Implements SA-Forum APIs where applicable. Used anywhere reliability is a must -- like telecom, wireless, defense and enterprise computing. Download stable release with installer from: ftp.openclovis.com
www.openclovis.com
GNU General Public License v2.0
19 stars 13 forks source link

UDP transport: SAFplus never comes up after being killed #172

Closed hungta closed 8 years ago

hungta commented 8 years ago

Configure the model using UDP transport. Start 2 SC Nodes. They were up after that Killed safplus_amf pid on the active node. SAFplus never comes up after that

hoangle commented 8 years ago

Did you enable heartbeat in clTransport.xml? Please take a look at example: https://github.com/OpenClovis/SAFplus-Availability-Scalability-Platform/tree/master/src/examples/cloud_example

hungta commented 8 years ago

Yes, the heartbeat was defined in clTransport.xml: ` <?xml version="1.0" encoding="UTF-8" standalone="no"?>

UDP libClUDP.so 3000 3000 3

`

hoangle commented 8 years ago

The clTransport.xml is not correct:

  1. multicast should not enable if peerAddrs exist
  2. Make sure those peer address nodes able to establish connection.

Please take a look at cloud_example model.

hungta commented 8 years ago

I removed the tag and retested. The first time, 2 nodes were up and roles were correct: On SCNodeI0: [aspinfo@SCNodeI0]==> nodes NODE CLASS AS CAS PS OS INSTANTIABLE CLUSTER-MEMBER ISU ASU SCNodeI0 B UL UL I E Y Y 1 1 SCNodeI1 B UL UL I E Y Y 1 1 [aspinfo@SCNodeI0]==> cluster NODE-NAME NODE-TYPE HA-STATE NODE-ADDR SCNodeI0 controller active 1 <-- this node SCNodeI1 controller standby 2

On SCNodeI1: [aspinfo@SCNodeI1]==> nodes NODE CLASS AS CAS PS OS INSTANTIABLE CLUSTER-MEMBER ISU ASU SCNodeI0 B UL UL I E Y Y 1 1 SCNodeI1 B UL UL I E Y Y 1 1 [aspinfo@SCNodeI1]==> cluster NODE-NAME NODE-TYPE HA-STATE NODE-ADDR SCNodeI0 controller active 1 SCNodeI1 controller standby 2 <-- this node

Then, killed the amf pid on SCNodeI0. A few moment later, the SCNodeI0 came up but seem that 2 nodes didn't talk each other: On SCNodeI0: [aspinfo@SCNodeI0]==> nodes NODE CLASS AS CAS PS OS INSTANTIABLE CLUSTER-MEMBER ISU ASU SCNodeI0 B UL UL I E Y Y 1 1 SCNodeI1 B UL UL U D N N 0 0 [aspinfo@SCNodeI0]==> cluster NODE-NAME NODE-TYPE HA-STATE NODE-ADDR SCNodeI0 controller active 1 <-- this node

On SCNodeI1: aspinfo@SCNodeI1]==> nodes NODE CLASS AS CAS PS OS INSTANTIABLE CLUSTER-MEMBER ISU ASU SCNodeI0 B UL UL U D N N 0 0 SCNodeI1 B UL UL I E Y Y 1 1 [aspinfo@SCNodeI1]==> cluster NODE-NAME NODE-TYPE HA-STATE NODE-ADDR SCNodeI1 controller active 2 <-- this node

hoangle commented 8 years ago

Please check the configure: `

     <peer addr="192.168.6.144"/>
  </peerAddresses>

` 192.168.56.143 and 192.168.6.144 - both have to communicate to each other

hungta commented 8 years ago

They can communicate to each other, of course

hoangle commented 8 years ago

Please attach asp.conf, ifconfig's output results at both node. I guess the issue related to configure since I don't see the issue as well

hoangle commented 8 years ago

Also, please give a try by reduce heartbeat interval value to 1000 (1second)

hungta commented 8 years ago

SCNodeI0: root@ubuntu:~/udptest# ifconfig eth0 Link encap:Ethernet HWaddr 08:00:27:ce:ff:a4
inet addr:10.20.18.108 Bcast:10.20.18.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fece:ffa4/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:125519 errors:0 dropped:0 overruns:0 frame:0 TX packets:4487 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:18259196 (18.2 MB) TX bytes:495011 (495.0 KB)

eth1 Link encap:Ethernet HWaddr 08:00:27:f3:05:e0
inet addr:192.168.56.143 Bcast:192.168.56.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fef3:5e0/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:183645 errors:0 dropped:0 overruns:0 frame:0 TX packets:125656 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:121421145 (121.4 MB) TX bytes:22219556 (22.2 MB)

eth1:11 Link encap:Ethernet HWaddr 08:00:27:f3:05:e0
inet addr:169.254.100.1 Bcast:169.254.100.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:94656 errors:0 dropped:0 overruns:0 frame:0 TX packets:94656 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:15790963 (15.7 MB) TX bytes:15790963 (15.7 MB) asp.conf:

OpenClovis Version 4.2.0

NODENAME - This specifies the name of the current node.

export NODENAME=SCNodeI0

DEFAULT_NODEADDR - This is the slot number that will be used for this node.

It comes from the SLOT_... definitions made in the target.conf. This is

used unless the AUTO_ASSIGN_NODEADDR is enabled.

export DEFAULT_NODEADDR=1

AUTO_ASSIGN_NODEADDR - This can be enabled if you want the slot number

to come from the chassis, based on where it is physically located

instead of having it preassigned.

#

To disable this feature, it can be undefined, or defined as "disable" or

"no". To enable this, it can be set as "physical-slot". If enabled, it

will attempt to retrieve the value using the IPMI driver. If this

attempt fails, either due to an IPMI error or the system is not chassis

based, the node address will fall back to the definition in DEFAULT_NODEADDR.

export AUTO_ASSIGN_NODEADDR=

SAHPI_UNSPECIFIED_DOMAIN_ID -

export SAHPI_UNSPECIFIED_DOMAIN_ID=UNDEFINED

OPENHPI_CONF - This specifies the location of openhpi.conf, and

is necessary for chassis management to function

export OPENHPI_CONF="${ASP_DIR}/etc/openhpi.conf"

MIBDIRS - This specifies the location of the standard and custom

SNMP MIBs.

export MIBDIRS="${ASP_DIR}/share/snmp/mibs"

SNMP_TRAP_ADDR - This is the IP address of the network management station.

This is important if you are using SNMP traps to send alarms to a

management station. This value is originally set based on the TRAP_IP value

in the target.conf file.

export SNMP_TRAP_ADDR=127.0.0.1

LINK_NAME - This is the name of network device to be used by the cluster.

It will likely be eth0 or eth1 in linux. This value is originally set

based on the value in the target.conf file.

export LINK_NAME=eth1

TIPC_NETID - The netid is used by tipc to keep different clusters separate.

If you have multiple models running on the same network, they should

each have a unique TIPC_NETID number. This value is originally set based

on the value in the target.conf file.

export TIPC_NETID=1340

ASP_SIMULATION - This is a boolean value (0/1) specifying whether you are

running an cluster in a simulation or not. Setting this to 1 allows you

to run multiple nodes on the same system.

#

For this to work, there must be a simulated ethernet device for each node

before the nodes are brought up. For example, on a two node cluster, you

could run these commands to prepare the ethernet devices:

ifconfig eth0:1 10.0.0.1

ifconfig eth0:2 10.0.0.2

#

The numbers appended to the eth0: (eg eth0:X ) correlate to the slot

numbers assigned to those nodes. So, if your two node cluster had one node

in slot one, and the other node in slot 3, you would need to type this

instead:

#

ifconfig eth0:1 10.0.0.1

ifconfig eth0:3 10.0.0.2

#

This flag can be turned on while running the sdk-4.2/src/SAFplus/configure

script with the --with-asp-simulation flag, or by setting it directly here.

export ASP_SIMULATION=0

SYSTEM_CONTROLLER - This is set to 1 when the current node is a system

controller node. For all others, this is set to 0. A system controller

node runs a few extra services, such as snmpd, and chassis management.

export SYSTEM_CONTROLLER=1

Run with the following valgrind command.

export ASP_VALGRIND_CMD=""

export BUILD_TIPC=1

export CL_LOG_STREAM_ENABLE=DEBUG export ASP_UDP_USE_EXISTING_IP=true export ASP_UDP_LINK_NAME=eth1

export ASP_UDP_SUBNET=192.168.56.0/24

SCNodeI1: root@ubuntu:~/udptest# ifconfig eth0 Link encap:Ethernet HWaddr 08:00:27:ad:7a:aa
inet addr:10.20.18.96 Bcast:10.20.18.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fead:7aaa/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:126195 errors:0 dropped:0 overruns:0 frame:0 TX packets:3689 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:18323178 (18.3 MB) TX bytes:420944 (420.9 KB)

eth1 Link encap:Ethernet HWaddr 08:00:27:c6:d6:cf
inet addr:192.168.56.144 Bcast:192.168.56.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fec6:d6cf/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:184072 errors:0 dropped:95 overruns:0 frame:0 TX packets:102100 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:119676854 (119.6 MB) TX bytes:20467196 (20.4 MB)

eth1:12 Link encap:Ethernet HWaddr 08:00:27:c6:d6:cf
inet addr:169.254.100.2 Bcast:169.254.100.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:83087 errors:0 dropped:0 overruns:0 frame:0 TX packets:83087 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:17083232 (17.0 MB) TX bytes:17083232 (17.0 MB)

asp.conf:

OpenClovis Version 4.2.0

NODENAME - This specifies the name of the current node.

export NODENAME=SCNodeI1

DEFAULT_NODEADDR - This is the slot number that will be used for this node.

It comes from the SLOT_... definitions made in the target.conf. This is

used unless the AUTO_ASSIGN_NODEADDR is enabled.

export DEFAULT_NODEADDR=2

AUTO_ASSIGN_NODEADDR - This can be enabled if you want the slot number

to come from the chassis, based on where it is physically located

instead of having it preassigned.

#

To disable this feature, it can be undefined, or defined as "disable" or

"no". To enable this, it can be set as "physical-slot". If enabled, it

will attempt to retrieve the value using the IPMI driver. If this

attempt fails, either due to an IPMI error or the system is not chassis

based, the node address will fall back to the definition in DEFAULT_NODEADDR.

export AUTO_ASSIGN_NODEADDR=

SAHPI_UNSPECIFIED_DOMAIN_ID -

export SAHPI_UNSPECIFIED_DOMAIN_ID=UNDEFINED

OPENHPI_CONF - This specifies the location of openhpi.conf, and

is necessary for chassis management to function

export OPENHPI_CONF="${ASP_DIR}/etc/openhpi.conf"

MIBDIRS - This specifies the location of the standard and custom

SNMP MIBs.

export MIBDIRS="${ASP_DIR}/share/snmp/mibs"

SNMP_TRAP_ADDR - This is the IP address of the network management station.

This is important if you are using SNMP traps to send alarms to a

management station. This value is originally set based on the TRAP_IP value

in the target.conf file.

export SNMP_TRAP_ADDR=127.0.0.1

LINK_NAME - This is the name of network device to be used by the cluster.

It will likely be eth0 or eth1 in linux. This value is originally set

based on the value in the target.conf file.

export LINK_NAME=eth1

TIPC_NETID - The netid is used by tipc to keep different clusters separate.

If you have multiple models running on the same network, they should

each have a unique TIPC_NETID number. This value is originally set based

on the value in the target.conf file.

export TIPC_NETID=1340

ASP_SIMULATION - This is a boolean value (0/1) specifying whether you are

running an cluster in a simulation or not. Setting this to 1 allows you

to run multiple nodes on the same system.

#

For this to work, there must be a simulated ethernet device for each node

before the nodes are brought up. For example, on a two node cluster, you

could run these commands to prepare the ethernet devices:

ifconfig eth0:1 10.0.0.1

ifconfig eth0:2 10.0.0.2

#

The numbers appended to the eth0: (eg eth0:X ) correlate to the slot

numbers assigned to those nodes. So, if your two node cluster had one node

in slot one, and the other node in slot 3, you would need to type this

instead:

#

ifconfig eth0:1 10.0.0.1

ifconfig eth0:3 10.0.0.2

#

This flag can be turned on while running the sdk-4.2/src/SAFplus/configure

script with the --with-asp-simulation flag, or by setting it directly here.

export ASP_SIMULATION=0

SYSTEM_CONTROLLER - This is set to 1 when the current node is a system

controller node. For all others, this is set to 0. A system controller

node runs a few extra services, such as snmpd, and chassis management.

export SYSTEM_CONTROLLER=1

Run with the following valgrind command.

export ASP_VALGRIND_CMD=""

export BUILD_TIPC=1

export CL_LOG_STREAM_ENABLE=DEBUG export ASP_UDP_USE_EXISTING_IP=true export ASP_UDP_LINK_NAME=eth1

export ASP_UDP_SUBNET=192.168.56.0/24

hoangle commented 8 years ago

Yeah, that's why I asked you check the configure: <peerAddresses port="6799"> <peer addr="192.168.56.143"/> <peer addr="192.168.6.144"/> </peerAddresses>

It should change to:

<peerAddresses port="6799"> <peer addr="192.168.56.143"/> <peer addr="192.168.56.144"/> </peerAddresses>

hungta commented 8 years ago

Yes. It's fault in udp configuration. Additionally, reducing the hearbeat interval makes the failover to go faster. Thanks