v0.1.22 - Timeout for hcl_domino_vagrant_rest_api

Aszusz commented 2 months ago

Tested Provisioners v 0.1.22 on Windows 11 using:

Domino 12.0.2
Nomad 1.0.10

The server does not finish provisioning. Last console output is:

TASK [hcl_domino_vagrant_rest_api : Checking Vagrant CRUD Rest API is listening on port 8080] ***
fatal: [9275--mercury.planets.com]: FAILED! => changed=false 
  elapsed: 60
  msg: Timeout waiting for 8080 to respond
[9275--mercury.planets.com] TASK: hcl_domino_vagrant_rest_api : Checking Vagrant CRUD Rest API is listening on port 8080 (debug)> Use exit() or Ctrl-D (i.e. EOF) to exit
[9275--mercury.planets.com] TASK: hcl_domino_vagrant_rest_api : Checking Vagrant CRUD Rest API is listening on port 8080 (debug)> User interrupted execution
Ansible failed to complete successfully. Any error output should be
visible above. Please fix these errors and try again.
[SHI]: 'vagrant up' stopped with exit code: 1, elapsed time: 00:00:06
[SHI]: Server destroyed

Attaching full server instance (minus installers) and a complete console log: 8760.zip console-log.txt

JoelProminic commented 2 months ago

Samuel sent some instructions for checking the Vagrant REST interface logs in the SHI instances. (I 'd like to add this to a wiki, but I'm not sure which repository would be best.)

Yesterday we have some issues finding the REST-Interface logs on a VM provisioned by SHI/demo-tasks. By default the REST-Interface writes logs to the console, in the Vagrant-REST-interface project we redirect that output to the destination log file (logs/output.log), I see that this is different in demo-tasks, we are using a systemd service to run the REST-interface, which seems better.

So, on the demo-tasks (SHI) VM, you can tail the REST-interface logs using: sudo journalctl -u restapi -f

To set the logging level to debug you can update the systemd unit at: /etc/systemd/system/restapi.service

Setting the parameter logging.level like: -Dlogging.level.root=DEBUG

Making the REST-interface to write logs to a file by default We could also update the REST-Interface scripts to write logs to a file by default, or change the systemd unit definition to redirect the output to a file. Please let me know if I should make the updates.

JoelProminic commented 2 months ago

We confirmed that the Vagrant REST API could start on a Windows VM in the meeting today. I'm thinking this is an issue with the timeout. We are doing a test now with the timeout significantly increased. Is there a better way to handle this check in Ansible?

Aszusz commented 2 months ago

This is still failing, regardless of timeout

JoelProminic commented 2 months ago

If we don't have a chance to review this in a meeting, please do some debugging on the failing VM instance. It would be better to run this immediately after the error, but it should still be useful if you haven't restarted yet.

# Check service
service restapi status
# check port
telnet 127.0.0.1 8080
# Check for valid HTML response
wget -O - http://127.0.0.1:8080
# check log
sudo journalctl -u restapi -f

The first two commands check if the Vagrant REST API is running. The last command shows the log (see above) so that we can check for errors.

Aszusz commented 2 months ago

After our meeting the reprovisioned machine failed with the same error as before:

TASK [hcl_domino_config : Waiting until the file certstore.nsf is present before continuing] ***
fatal: [8760--venus.planets.com]: FAILED! => changed=false
  elapsed: 300
  msg: Timeout when waiting for file /local/notesdata/certstore.nsf
[8760--venus.planets.com] TASK: hcl_domino_config : Waiting until the file certstore.nsf is present before continuing (debug)> Use exit() or Ctrl-D (i.e. EOF) to exit
[8760--venus.planets.com] TASK: hcl_domino_config : Waiting until the file certstore.nsf is present before continuing (debug)> User interrupted execution
Ansible failed to complete successfully. Any error output should be
visible above. Please fix these errors and try again.

JoelProminic commented 2 months ago

Did we see this error before? It is not the same as the error reported originally in this issue. In this case, the timeout is in the hcl_domino_config role.

To debug this case, you can try:

# Check that the Domino server is running
sudo su -
service domino status
# Domino console - look for errors
domino log
# check the database
ls -lh /local/notesdata/certstore.nsf 
# check the load
htop
top

Unfortunately, I think this further supports the idea that this is a perforamnce/load issue.

JoelProminic commented 2 months ago

@piotrzarzycki21 and @Aszusz clarified today that this error triggering on macOS as well. I did a test with Super.Human.Installer Development v0.9.7 today, and confirmed that I got the same error:

I checked the log and confirmed that the service failed to start:

startcloud@test-097:~$ sudo journalctl -u restapi -f
Apr 30 15:48:04 test-097.shi.com java[79961]:         at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:88)
Apr 30 15:48:04 test-097.shi.com systemd[1]: restapi.service: Main process exited, code=exited, status=1/FAILURE
Apr 30 15:48:04 test-097.shi.com systemd[1]: restapi.service: Failed with result 'exit-code'.
Apr 30 15:48:04 test-097.shi.com systemd[1]: restapi.service: Consumed 2.624s CPU time.
Apr 30 15:48:04 test-097.shi.com systemd[1]: restapi.service: Scheduled restart job, restart counter is at 26.
Apr 30 15:48:04 test-097.shi.com systemd[1]: Stopped restapi.service - Vagrant Crud Rest API.
Apr 30 15:48:04 test-097.shi.com systemd[1]: restapi.service: Consumed 2.624s CPU time.
Apr 30 15:48:04 test-097.shi.com systemd[1]: restapi.service: Start request repeated too quickly.
Apr 30 15:48:04 test-097.shi.com systemd[1]: restapi.service: Failed with result 'exit-code'.
Apr 30 15:48:04 test-097.shi.com systemd[1]: Failed to start restapi.service - Vagrant Crud Rest API.

I needed to increase the line count to see the actual error (sudo journalctl -u restapi -f -n 100):

Apr 30 15:48:03 test-097.shi.com java[79961]: 15:48:03,830 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@57855c9a - Registering current configuration as safe fallback point
Apr 30 15:48:04 test-097.shi.com java[79961]: 2024-04-30 15:48:04.359 ERROR --- [           main] o.s.boot.SpringApplication               : Application run failed
Apr 30 15:48:04 test-097.shi.com java[79961]: org.yaml.snakeyaml.scanner.ScannerException: while scanning for the next token
Apr 30 15:48:04 test-097.shi.com java[79961]: found character '\t(TAB)' that cannot start any token. (Do not use \t(TAB) for indentation)
Apr 30 15:48:04 test-097.shi.com java[79961]:  in 'reader', line 12, column 5:
Apr 30 15:48:04 test-097.shi.com java[79961]:                         serverName: test-097/TEST097
Apr 30 15:48:04 test-097.shi.com java[79961]:         ^
Apr 30 15:48:04 test-097.shi.com java[79961]:         at org.yaml.snakeyaml.scanner.ScannerImpl.fetchMoreTokens(ScannerImpl.java:439)

It looks like the problem is that these updates (related to this comment) had tabs instead of spaces.

I made this fix locally, and got a different error:

Apr 30 16:57:45 test-097.shi.com java[80019]: 2024-04-30 16:57:45.318 ERROR --- [           main] o.s.boot.SpringApplication               : Application run failed
Apr 30 16:57:45 test-097.shi.com java[80019]: org.yaml.snakeyaml.parser.ParserException: while parsing a block collection
Apr 30 16:57:45 test-097.shi.com java[80019]:  in 'reader', line 8, column 9:
Apr 30 16:57:45 test-097.shi.com java[80019]:             - upload-database
Apr 30 16:57:45 test-097.shi.com java[80019]:             ^
Apr 30 16:57:45 test-097.shi.com java[80019]: expected <block end>, but found '?'
Apr 30 16:57:45 test-097.shi.com java[80019]:  in 'reader', line 12, column 9:
Apr 30 16:57:45 test-097.shi.com java[80019]:             serverName: test-097/TEST097
Apr 30 16:57:45 test-097.shi.com java[80019]:             ^
Apr 30 16:57:45 test-097.shi.com java[80019]:         at org.yaml.snakeyaml.parser.ParserImpl$ParseBlockSequenceEntryKey.produce(ParserImpl.java:549)
Apr 30 16:57:45 test-097.shi.com java[80019]:         at org.yaml.snakeyaml.parser.ParserImpl.peekEvent(ParserImpl.java:165)

I found there were problems with the alignment:

java_user@test-097:~$ cat /opt/rest-interface/config/application.yml 
grails:
    controllers:
        upload:
            maxFileSize: 104857600
            maxRequestSize: 104857600
restinterface:
    capabilities:
        - upload-database
        - upload-html
        - java-domino-gradle
        - nsfodp
        serverName: test-097/TEST097
      baseURL: https://127.0.0.1:8080

I made some updates to the position of the conditions. This cleared the hcl_domino_vagrant_rest_api timeout. I'll check back later to see if I get the certstore.nsf error or some other problem.

This doesn't explain why the REST API started up properly when the VM was restarted, though, so I'm wondering if there are more problems here.

JoelProminic commented 2 months ago

I got the certstore.nsf error was well. I confirmed that the database was missing with the above commands in vagrant ssh. I did not see any unusual load on the server.

My theory is that this is related to the setup.json.j2 updates for Domino 12.0.1. I did a comparison of the generated setup.json with an old, working Domino 12.0.1 setup.json. I see some differences, but I don't see any obvious problems. I'll plan to resume investigation on this tomorrow.

JoelProminic commented 2 months ago

I checked the Domino install log like this:

***** /local/notesdata/IBM_TECHNICAL_SUPPORT/autoconfigure.log  - 04/30/2024 05:39:11 PM **************************************************************************** 
Loading AutoConfiguration parameters from the JSON file setup.json                                                                                                    [REGParameterDrivenSetup:4039]
Error processing JSON config file setup.json:                                                                                                                         [REGParameterDrivenSetup:4046]
Missing a name for object member. at offset 3872 ["ServerName":"CN=test-097/O=TEST097",]                                                                              [REGParameterDrivenSetup:4051]
AutoConfiguration error: Validation failed.                                                                                                                           [REGParameterDrivenSetup:4234]

I found tha the problem was a trailing comma on these "ServerName" lines.

I also found that we should avoid leading whitespace on the {%... lines - this whitespace gets added to the beginning of the next line and makes the generated file look messy. I made a quick update for this in setup.json.j2.

@Aszusz, @piotrzarzycki21, please give this a test when you have a chance.

JoelProminic commented 2 months ago

The updates are included in SHI Development 0.9.8. I ran a quick test, and the install completed successfully. I am seeing some errors when I try to test the FormBuilder and Domino Visual Editor actions, but I think that is probably more relevant for https://github.com/Moonshine-IDE/Super.Human.Installer/issues/99 or https://github.com/STARTcloud/hcl_roles/issues/5

UPDATE: The errors I saw ended up being because I used the 12.0.1 installer with the 12.0.2 fixpack. I restarted my test and this is working well.

Aszusz commented 2 months ago

This is finally working for me on Windows, but with some caveats:

I'm seeing a lot of blue screen crashes on Windows 11. They are intermittent, seem not to be related to S.H.I (can happen with vagrant up), but it is possible to provision the server with successful luck roll. I've seen them on three separate Windows 11 machines.
Provisioned server has IP null problem. When I substitute the address manually in hosts.yml the webpage works

MarkProminic commented 1 month ago

The code that caused this issue, even though it was found to be a comma, is being removed thanks to this issue being confirmed working: https://github.com/STARTcloud/hcl_roles/issues/5

STARTcloud / hcl_domino_standalone_provisioner

v0.1.22 - Timeout for hcl_domino_vagrant_rest_api #4