NASA-AMMOS / AIT-DSN

MIT License
19 stars 10 forks source link

Fix round-robin connections for SLE interfaces #163

Closed nttoole closed 2 years ago

nttoole commented 2 years ago

User reported that round-robin connection attempts failed if the first connection fails

https://github.com/NASA-AMMOS/AIT-DSN/blob/1856df4bfd7469c0f0cbfd1a16ce7b546fb71f1c/ait/dsn/sle/common.py#L281

We should consider closing and creating a new Socket for each attempt.

nttoole commented 2 years ago

Testing notes


The current code works fine for the original test which use a made-up hostnames as failure cases.

Config:

               hostnames:
                    - example.hostname.1
                    - atb-ocio-sspsim.jpl.nasa.gov
                    - example.hostname.2

Results:

2022-03-28T16:55:04.015 | INFO     | Failed to connect to DSN at example.hostname.1. Trying next hostname.
2022-03-28T16:55:04.052 | INFO     | Connection to DSN successful through atb-ocio-sspsim.jpl.nasa.gov.
2022-03-28T16:55:04.054 | INFO     | Configuring SLE connection...
2022-03-28T16:55:04.055 | INFO     | SLE connection configuration successful
2022-03-28T16:55:06.058 | INFO     | Sending Bind request ...
2022-03-28T16:55:06.146 | INFO     | Bind successful
2022-03-28T16:55:08.062 | INFO     | Sending data start invocation ...
2022-03-28T16:55:08.163 | INFO     | Start successful
2022-03-28T16:55:08.264 | INFO     | Production Status Report: running

However, if we test with an actual host but not SLE service (e.g. www.google.com), then we witness the user-reported error.

Config:

               hostnames:
                    - www.google.com
                    - example.hostname.1
                    - atb-ocio-sspsim.jpl.nasa.gov
                    - example.hostname.2

Results:

2022-03-28T16:57:59.750 | INFO     | Failed to connect to DSN at www.google.com: Trying next hostname.
2022-03-28T16:57:59.759 | INFO     | Failed to connect to DSN at example.hostname.1. Trying next hostname.
2022-03-28T16:57:59.802 | INFO     | Failed to connect to DSN at atb-ocio-sspsim.jpl.nasa.gov. Trying next hostname.
2022-03-28T16:57:59.805 | INFO     | Failed to connect to DSN at example.hostname.2. Trying next hostname.
2022-03-28T16:57:59.806 | ERROR    | Connection failure with DSN. Aborting ...
nttoole commented 2 years ago

After applying patch, re-running with config:

               hostnames:
                    - www.google.com
                    - example.hostname.1
                    - atb-ocio-sspsim.jpl.nasa.gov
                    - example.hostname.2

...resulted in successful connection with the third entry of hostnames:

2022-03-29T11:51:14.498 | INFO     | Failed to connect to DSN at www.google.com. Trying next hostname.
2022-03-29T11:51:14.505 | INFO     | Failed to connect to DSN at example.hostname. Trying next hostname.
2022-03-29T11:51:14.663 | INFO     | Connection to DSN successful through atb-ocio-sspsim.jpl.nasa.gov.
2022-03-29T11:51:14.665 | INFO     | Configuring SLE connection...
2022-03-29T11:51:14.666 | INFO     | SLE connection configuration successful
2022-03-29T11:51:16.667 | INFO     | Sending Bind request ...
2022-03-29T11:51:16.775 | INFO     | Bind successful
2022-03-29T11:51:18.669 | INFO     | Sending data start invocation ...
2022-03-29T11:51:18.735 | INFO     | Start successful
2022-03-29T11:51:18.847 | INFO     | Production Status Report: running
nttoole commented 2 years ago

Testing involved running : ait/dsn/bin/examples/raf_api_test

with SSP service setup detailed here: https://github.com/NASA-AMMOS/AIT-DSN/blob/017854200dc51929dff7c7662bc2f7f52dd8eb34/ait/dsn/bin/examples/raf_api_test.py#L17