clustervision / trinityX

TrinityX is the new generation of ClusterVision's open-source HPC, A/I and cloudbursting platform. It is designed from the ground up to provide all services required in a modern HPC and A/I system, and to allow full customization of the installation.
GNU General Public License v3.0
67 stars 37 forks source link

New cluster install, unable to configure BMC #444

Open wccropper opened 4 days ago

wccropper commented 4 days ago

I have just installed a new controller node and I am unable to access the :8080 dashboard. I can access the nginx page when just accessing https. When I curl the :8080 it returns

titan 10:14:48 [root@titan-master1 ~]# curl https://titan-master1.global.internal:8080
curl: (60) SSL: no alternative certificate subject name matches target host name 'titan-master1.global.internal'
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
titan 10:15:26 [root@titan-master1 ~]# curl -k https://titan-master1.global.internal:8080
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://titan-master1.gloabl.internal:8080/">here</a>.</p>
</body></html>

Below are my current configs. I used the main branch and the INSTALL.sh script.

titan 10:02:49 [root@titan-master1 ~]# luna cluster
+---------------------------------------------------------------------------------+
|                                 Cluster => titan                                |
+---------------------+-----------------------------------------------------------+
| name                | titan                                                     |
| controller          | hostname = titan-master1                                  |
|                     | status = None                                             |
|                     | vendor = None                                             |
|                     | serverport = 7050                                         |
|                     | ipaddress = 192.168.213.254                               |
|                     | luna_config = /trinity/local/luna/daemon/config/luna.ini  |
| technical_contacts  | root@localhost                                            |
| provision_method    | torrent                                                   |
| provision_fallback  | http                                                      |
| nameserver_ip       | 192.168.213.254                                           |
| forwardserver_ip    | 192.168.202.91,192.168.202.92                             |
| domain_search       | cluster,ib,ipmi,global.internal                           |
| ntp_server          | 192.168.213.254                                           |
| security            | False                                                     |
| createnode_ondemand | True                                                      |
| user                | None                                                      |
| debug               | False                                                     |
| packing_bootpause   | True                                                      |
| nextnode_discover   | False                                                     |
+---------------------+-----------------------------------------------------------+
titan 10:06:44 [root@titan-master1 ~]# luna network list
+--------------------------------------------------------------------------------------------+
|                                        << Network >>                                       |
+---+---------+--------------------+------------+-------+------------------+-----------------+
| # | name    | network            | type       | dhcp  | dhcp_range_begin | dhcp_range_end  |
+---+---------+--------------------+------------+-------+------------------+-----------------+
| 1 | cluster | 192.168.213.128/25 | ethernet   | True  | 192.168.213.129  | 192.168.213.169 |
|   |         |                    |            |       |                  |                 |
| 2 | ipmi    | 192.168.213.0/25   | ethernet   | False | --NA--           | --NA--          |
|   |         |                    |            |       |                  |                 |
| 3 | ib      | 10.149.0.0/16      | infiniband | False | --NA--           | --NA--          |
+---+---------+--------------------+------------+-------+------------------+-----------------+
titan 10:06:51 [root@titan-master1 ~]# luna group list
+--------------------------------------------------------------------+
|                             << Group >>                            |
+---+---------+--------------+---------+-------+---------------------+
| # | name    | bmcsetupname | osimage | roles | interfaces          |
+---+---------+--------------+---------+-------+---------------------+
| 1 | compute | compute      | compute | None  | interface = BOOTIF  |
|   |         |              |         |       | network = cluster   |
|   |         |              |         |       | interface = BMC     |
|   |         |              |         |       | network = ipmi      |
| 2 | ubuntu  | --NA--       | ubuntu  | None  | interface = BOOTIF  |
|   |         |              |         |       | network = cluster   |
+---+---------+--------------+---------+-------+---------------------+
titan 10:10:58 [root@titan-master1 ~]# luna node list
+------------------------------------------------------------------------------+
|                                  << Node >>                                  |
+---+---------+---------+---------+----------+----------+--------+-------------+
| # |   name  |  group  | osimage | setupbmc | bmcsetup | status | tpm_present |
+---+---------+---------+---------+----------+----------+--------+-------------+
| 1 | node001 | compute | compute |   True   | compute  |  None  |    False    |
| 2 | node002 | compute | compute |   True   | compute  |  None  |    False    |
| 3 | node003 | compute | compute |   True   | compute  |  None  |    False    |
| 4 | node004 | compute | compute |   True   | compute  |  None  |    False    |
+---+---------+---------+---------+----------+----------+--------+-------------+
titan 10:11:02 [root@titan-master1 ~]# firewall-cmd --list-all
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: bond0 eno1 eno2 eno4
  sources:
  services: cockpit dhcpv6-client ssh
  ports: 3389/tcp 22/tcp 443/tcp 8080/tcp 9090/tcp 9093/tcp 3000/tcp
  protocols:
  forward: yes
  masquerade: yes
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:
wccropper commented 4 days ago

I found 2 issues so far. 1st was an ACL on the network switch whcih has been fixed to allow 8080. 2nd was i had a type in the external fqdn. I have fixed this in the /etc/httpd/conf.d/ood-portal.conf and restarted httpd. I now get the following redirect url and error:

https://titan-master1.global.internal:8080/pun/sys/dashboard

Internal Server Error

The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator at root@localhost to inform them of the time this error occurred, and the actions you performed just before this error.

More information about this error may be available in the server error log.
wccropper commented 4 days ago

I have searched all files in /etc /trinity and /opt, fixing the gloabl to global typo. How do I recreate the ssl certs? I ran the (updated typo) ansible=playbook controller.yml, but this did not update the certs.

wccropper commented 4 days ago

I ended up reinstalling completely (no typo) and I have a semi-working cluster and can access the dashboard. I am not able to get a node to PXE boot now. receiving this error on the node: image

wccropper commented 4 days ago

I was able to fix this by adding 192.168.213.0/24 to the sources on the trusted zone. Now the node was setup to use BOOTIF 192.168.213.71 and BMC 192.168.213.71, but it change the ip to 169.254.0.2. I usually use a dedicated iDRAC not a shared and have each node configured so I can access them in case of issue. Does this not control the ip assigned?

[root@titan-master1 trinityX]# luna network list
+--------------------------------------------------------------------------------------------+
|                                        << Network >>                                       |
+---+---------+--------------------+------------+-------+------------------+-----------------+
| # | name    | network            | type       | dhcp  | dhcp_range_begin | dhcp_range_end  |
+---+---------+--------------------+------------+-------+------------------+-----------------+
| 1 | cluster | 192.168.213.128/25 | ethernet   | True  | 192.168.213.129  | 192.168.213.169 |
|   |         |                    |            |       |                  |                 |
| 2 | ipmi    | 192.168.213.0/25   | ethernet   | False | --NA--           | --NA--          |
|   |         |                    |            |       |                  |                 |
| 3 | ib      | 10.149.0.0/16      | infiniband | False | --NA--           | --NA--          |
+---+---------+--------------------+------------+-------+------------------+-----------------+
wccropper commented 4 days ago

I was able to get the idrac access back using the racadm tool. It has wiped the gateway. luna/lpower are unable to communicate with the idrac. for now I have disabled it from being managed on the nodes, but left it enabled on the group. Any assistance here is appreciated.

aphmschonewille commented 4 days ago

Hi wccropper.

As you have found out (also mentioned in more places), the trix_external_fqdn must match with how the controller is resolved from the outside, where you typically connect to to open the dashboard on port 8080. Seeing the internal server error is a strong indication that the certificate(s) don't match with how the server was reached. since we use the certificate in more places, simply re-running the playbook won't solve this. we can help you telling how to recreate certificates. this involves a few steps to make sure other things like openldap won't break. however I assume you have this sorted based on your answer above?

it's not clear why your machines don't do pxe properly. a screenshot shows this but we might have to have logs (e.g. /var/log/luna/luna2-daemon.log and the group_vars/all.yml) to give us a bit more insight. Also 'lexport -c -e' is helpful as it tells us how the ip-s were assigned, networks configured etc etc.

lastly, the bmc plugin should work with ipmi compliant machines. idrac shouldn't be a problem. if the gateway is lost, then most likely there was no gateway defined for the ipmi network?

in short, i see quite a few messages and try to understand what's happening. can you share logs/data here or by other means?

with kind regards, Antoine