aristanetworks / atd-public

24 stars 32 forks source link

Datacenter-nocvp topology returns 502 error when accessing lab interface #226

Closed grybak-arista closed 3 years ago

grybak-arista commented 3 years ago

Describe the bug The topologies/datacenter-latest/Datacenter-nocvp.yml topology does not complete the ATD post-deployment setup process. Accessing the lab module page results in a 502 Bad Gateway error.

To Reproduce

  1. Deploy topologies/datacenter-latest/Datacenter-nocvp.yml (NOT from the vATD public module page, but with cloud-deploy)
  2. Attempt to access a lab module using the fqdn or IP address, e.g.
    http://topo-dff80b9dccb65e42.atd.arista.com/module?lab=ucn-mlag
    http://54.176.206.212/module?lab=ucn-mlag

Expected behavior Expected to see the normal UCN lab module landing page.

Screenshots

Additional context This issue is relevant to the upcoming vATD live public release, as the current setup deploys topologies/datacenter-latest/Datacenter.yml, which deploys an unused CVP node, in order to allow the UCN labs to run. To reduce cost, we would like to not have to deploy the CVP node for the UCN labs.

From the information provided below, this appears to be an issue with the ATD services on startup. The cloud-deploy process is completing successfully, and the nodes are all reachable by CLI directly. Just the lab landing pages are returning the error. The sslUpdater and labModule services seem to be reporting a problem.

Logging in to the jump host and checking the status of the ATD services yields the following output:

arista@jump-host:~$ sudo service atdServiceUpdater status
[sudo] password for arista:
● atdServiceUpdater.service - Automatically checks ATD Repo and updates any specified changed ATD services
   Loaded: loaded (/lib/systemd/system/atdServiceUpdater.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Thu 2020-11-12 22:54:18 UTC; 10min ago

Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Has not changed sslUpdater.py
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Has not changed sslUpdater.service
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Restarted systemctl daemon
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Enabled sslUpdater
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Has not changed labModule.service
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Has not changed labModule.py
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Restarted systemctl daemon
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Enabled labModule
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Complete!
Nov 12 22:54:18 ip-10-33-6-59 systemd[1]: Started Automatically checks ATD Repo and updates any specified changed ATD services.

arista@jump-host:~$ sudo service atdServiceUpdater status
● atdServiceUpdater.service - Automatically checks ATD Repo and updates any specified changed ATD services
   Loaded: loaded (/lib/systemd/system/atdServiceUpdater.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Thu 2020-11-12 22:54:18 UTC; 10min ago

Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Has not changed sslUpdater.py
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Has not changed sslUpdater.service
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Restarted systemctl daemon
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Enabled sslUpdater
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Has not changed labModule.service
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Has not changed labModule.py
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Restarted systemctl daemon
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Enabled labModule
Nov 12 22:54:18 ip-10-33-6-59 atdServiceUpdater.py[1146]: [OK]      Complete!
Nov 12 22:54:18 ip-10-33-6-59 systemd[1]: Started Automatically checks ATD Repo and updates any specified changed ATD services.

arista@jump-host:~$ sudo service atdFiles status
● atdFiles.service - Automatically downloads updated ATD files and documentation
   Loaded: loaded (/lib/systemd/system/atdFiles.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2020-11-12 22:57:56 UTC; 7min ago
 Main PID: 4369 (node)
    Tasks: 14 (limit: 4682)
   CGroup: /system.slice/atdFiles.service
           ├─4369 /usr/bin/node /usr/lib/node_modules/forever/bin/monitor /opt/webssh2/app/index.js
           └─4413 /usr/bin/node /opt/webssh2/app/index.js

Nov 12 22:57:56 ip-10-33-6-59 atdFiles.sh[1126]: ype1/urw/helvetic/uhvbo8a.pfb></usr/share/texlive/texmf-dist/fonts/type1/urw/ti
Nov 12 22:57:56 ip-10-33-6-59 atdFiles.sh[1126]: mes/utmb8a.pfb></usr/share/texlive/texmf-dist/fonts/type1/urw/times/utmr8a.pfb>
Nov 12 22:57:56 ip-10-33-6-59 atdFiles.sh[1126]: </usr/share/texlive/texmf-dist/fonts/type1/urw/times/utmri8a.pfb>
Nov 12 22:57:56 ip-10-33-6-59 atdFiles.sh[1126]: Output written on ATD.pdf (189 pages, 9034487 bytes).
Nov 12 22:57:56 ip-10-33-6-59 atdFiles.sh[1126]: Transcript written on ATD.log.
Nov 12 22:57:56 ip-10-33-6-59 atdFiles.sh[1126]: Latexmk: Index file 'ATD.idx' was written
Nov 12 22:57:56 ip-10-33-6-59 atdFiles.sh[1126]: Latexmk: Log file says output to 'ATD.pdf'
Nov 12 22:57:56 ip-10-33-6-59 atdFiles.sh[1126]: Latexmk: All targets (ATD.pdf) are up-to-date
Nov 12 22:57:56 ip-10-33-6-59 atdFiles.sh[1126]: make[1]: Leaving directory '/tmp/atd/topologies/datacenter-latest/labguides/build/latex'
Nov 12 22:57:56 ip-10-33-6-59 systemd[1]: Started Automatically downloads updated ATD files and documentation.

arista@jump-host:~$ sudo service gitConfigletSync status
● gitConfigletSync.service - Automatically updates configlets
   Loaded: loaded (/lib/systemd/system/gitConfigletSync.service; disabled; vendor preset: enabled)
   Active: inactive (dead) since Thu 2020-11-12 22:57:57 UTC; 7min ago
  Process: 4465 ExecStart=/usr/local/bin/gitConfigletSync.py (code=killed, signal=TERM)

Nov 12 22:57:56 ip-10-33-6-59 systemd[1]: Starting Automatically updates configlets...
Nov 12 22:57:56 ip-10-33-6-59 /gitConfigletSync.py[4465]: [OK]      Starting...
Nov 12 22:57:56 ip-10-33-6-59 /gitConfigletSync.py[4465]: [INFO]    CVP is not present in this topology, disabling gitConfigletSync
Nov 12 22:57:56 ip-10-33-6-59 gitConfigletSync.py[4465]: Removed /etc/systemd/system/multi-user.target.wants/gitConfigletSync.service.
Nov 12 22:57:57 ip-10-33-6-59 systemd[1]: Stopped Automatically updates configlets.

arista@jump-host:~$ sudo service cvpUpdater status
● cvpUpdater.service - Automatically configures CVP with EOS devices depending on the topology
   Loaded: loaded (/lib/systemd/system/cvpUpdater.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Thu 2020-11-12 22:57:57 UTC; 7min ago
  Process: 4466 ExecStart=/usr/local/bin/cvpUpdater.py (code=exited, status=0/SUCCESS)

Nov 12 22:57:56 ip-10-33-6-59 systemd[1]: Starting Automatically configures CVP with EOS devices depending on the topology...
Nov 12 22:57:56 ip-10-33-6-59 /cvpUpdater.py[4466]: [OK]      Starting...
Nov 12 22:57:56 ip-10-33-6-59 /cvpUpdater.py[4466]: [INFO]    CVP is not present in this topology, preventing future run of cvpUpdater
Nov 12 22:57:57 ip-10-33-6-59 systemd[1]: Started Automatically configures CVP with EOS devices depending on the topology.

arista@jump-host:~$ sudo service sslUpdater status
● sslUpdater.service - Automatically updates the CVP self-signed cert
   Loaded: loaded (/lib/systemd/system/sslUpdater.service; enabled; vendor preset: enabled)
   Active: activating (auto-restart) (Result: exit-code) since Thu 2020-11-12 23:04:58 UTC; 18s ago
  Process: 5291 ExecStart=/usr/local/bin/sslUpdater.py (code=exited, status=1/FAILURE)

arista@jump-host:~$ sudo service labModule status
● labModule.service - Creates a new lab UI for modules
   Loaded: loaded (/lib/systemd/system/labModule.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Thu 2020-11-12 22:57:57 UTC; 7min ago
  Process: 4541 ExecStart=/usr/local/bin/labModule.py (code=exited, status=0/SUCCESS)

Nov 12 22:57:57 ip-10-33-6-59 systemd[1]: Starting Creates a new lab UI for modules...
Nov 12 22:57:57 ip-10-33-6-59 /labModule.py[4541]: [OK]      Starting...
Nov 12 22:57:57 ip-10-33-6-59 /labModule.py[4541]: [OK]      The default_lab parameter was not found in ACCESS_INFO.yaml, exiting...
Nov 12 22:57:57 ip-10-33-6-59 systemd[1]: Started Creates a new lab UI for modules.
networkRob commented 3 years ago

@grybak-arista that page is only for the vATDs and not standard ATDs. This is expected behavior.

It does not run since there is not an app: parameter in ACCESS_INFO.yaml. Which is required to run that application.

grybak-arista commented 3 years ago

@networkRob The issue we are seeing in the Datacenter-nocvp.yml topology is affecting vATD. When we allow our cache pool to deploy the -nocvp topology it is unreachable from the vATD frontend. When we manually access the page by entering the expected address in a browser, it returns the 502 error.

At the moment, we are using the Datacenter.yml topology for all deployments for the vATD pool, though we do not need the CVP host. So testing from the vATD frontend is giving a false success because it is not actually using the -nocvp topology right now.

Since vATD is going live very soon, we need to try to get this worked out as soon as possible.

grybak-arista commented 3 years ago

As an aside but in reference to the -nocvp topology, I would suggest changing the topology name in Datacenter-nocvp.yml to distinguish it from the Datacenter.yml topology name. Something like datacenter-latest-nocvp should be fine.

networkRob commented 3 years ago

@grybak-arista Based on this log you provided:

● labModule.service - Creates a new lab UI for modules
   Loaded: loaded (/lib/systemd/system/labModule.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Thu 2020-11-12 22:57:57 UTC; 7min ago
  Process: 4541 ExecStart=/usr/local/bin/labModule.py (code=exited, status=0/SUCCESS)

Nov 12 22:57:57 ip-10-33-6-59 systemd[1]: Starting Creates a new lab UI for modules...
Nov 12 22:57:57 ip-10-33-6-59 /labModule.py[4541]: [OK]      Starting...
Nov 12 22:57:57 ip-10-33-6-59 /labModule.py[4541]: [OK]      The default_lab parameter was not found in ACCESS_INFO.yaml, exiting...
Nov 12 22:57:57 ip-10-33-6-59 systemd[1]: Started Creates a new lab UI for modules.

The deployment server is not providing the app: <lab_type> parameter, ie app: ucn-mlag This parameter is needed to have labModule run successfully which configure all the EOS nodes in place of CVP if it is not present in the topology. Also if the app: parameter is there, then it will start the labUI application, which is the endpoint application for http://<fqdn_or_ip>/module?lab=<lab_module>

Do you have an active topo for me to investigate with?

networkRob commented 3 years ago

Performed a test on a standard datacenter-latest topology and performed the following actions:

  1. Removed any reference to CVP in ACCESS_INFO.yaml
  2. Added the following key/value to ACCESS_INFO.yaml app: ucn-mlag
  3. Rebooted the jumphost to simulate a topology startup

Everything worked as expected.

grybak-arista commented 3 years ago

Running some more tests on my side just to make sure.

grybak-arista commented 3 years ago

We had to do some extra parsing due to the shared topology_name in the two topologies. When I tried simply changing the topology_name in the -nocvp version, the 502 error shows up. I guess there must be something in the scripts that is looking for datacenter-latest, and datacenter-latest-nocvp throws it off.

Added some extra handling code in the lab pool code to make sure the nocvp version is used when the lab is a UCN lab. Leaving the nocvp topology_version as datacenter-latest for now.

Going to close this issue since it seems to be working with the changes to the lab pool.

networkRob commented 3 years ago

Since datacenter-latest topologies with CVP and without CVP leverage the same content, they both target the topologies/datacenter-latest directory and content. This reduces the overhead on our end.