No such file or directory: 'mount' #107

Closed ghost closed 3 years ago

In GitLab by @infinitewarp on Aug 10, 2020, 10:23

Summary

Sometimes houndigrade raises FileNotFoundError unexpectedly because the mount command appears to have gone missing. It is unclear how this is possible since mount should always be part of the base image.

Steps to Reproduce

Generate activity such that houndigrade runs to inspect an AMI.
Unknown... :confused:

Expected Result

Inspection processes normally.
A message for each image is posted to the configured SQS "inspection results" queue.
houndigrade ECS scales down to 0.

Actual Result

Inspection blows up unexpectedly.
No message is posted to SQS.
houndigrade ECS stays scaled up to 1 forever (or until the reaper forces it down).

Additional context

QE people will at a minimum run regression tests around the inspection process to verify that we did not regress any previously-working functionality.
If we find reliable steps to reproduce the problem, devs discuss with QEs during development to determine if it's possible and reasonable for QE to build new test for that problem.
Is there a way to put the inspection startup in a loop and capture the logs to see if it's possible to recreate and observe an actual error run like this?
- @infinitewarp thinks this would be difficult with AWS due to how we configure the ECS task to delete volumes upon completion, but it's worth taking a few minutes to investigate and try. Don't get stuck on this if it's not straightforward.
What should we do to better handle this if we can't find the cause?
Should we catch this exception, report something interesting to stdout so it's logged, and report the images back to SQS with "error" state so houndigrade can cleanly scale down?
Should we have a new/different message format for the inspection results queue to indicate that houndigrade failed to run but the images are neither error nor inspected?
- Yes, let's do this. At the very least, it gets us more information that we desperately need.
Should we simply check that mount exists and is executable before we call it?
- This should not be necessary since the image should be 100% static, but checking before would allow us to report an error and cleanly exit.
- If we do this, we need to put some kind of message on the results SQS queue to tell cloudigrade that houndigrade encountered an unrecoverable error and could not inspect the listed image.
Should we think of some new way to back off and retry running the houndigrade task?
- No we don't want to back off and retry.
CloudWatch logs in our production AWS account indicate a houndigrade run at 2020-08-07T18:49:54.269Z for image ami-0ff6d47107892dd85 failed with this error, but the next run at 2020-08-07T19:49:25.510Z for images ami-00ff566775b0b66e1 and ami-0ff6d47107892dd85 did not fail.
Note that the second run included the same image as the first run.

Log of first run that raised this exception:

--------------------------------------------------------------------------------------------------------
|   timestamp   |                                       message                                        |
|---------------|--------------------------------------------------------------------------------------|
| 1596826194262 | Provided cloud: aws                                                                  |
| 1596826194262 | Provided drive(s) to inspect: (('ami-0ff6d47107892dd85', '/dev/xvdba'),)             |
| 1596826194262 | Checking drive /dev/xvdba                                                            |
| 1596826194263 | Checking partition /dev/xvdba1 for image ami-0ff6d47107892dd85                       |
| 1596826194269 | Traceback (most recent call last):                                                   |
| 1596826194269 |   File "cli.py", line 626, in <module>                                               |
| 1596826194269 |     main()                                                                           |
| 1596826194269 |   File "/usr/local/lib/python3.8/site-packages/click/core.py", line 829, in __call__ |
| 1596826194269 |     return self.main(*args, **kwargs)                                                |
| 1596826194269 |   File "/usr/local/lib/python3.8/site-packages/click/core.py", line 782, in main     |
| 1596826194269 |     rv = self.invoke(ctx)                                                            |
| 1596826194269 |   File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke  |
| 1596826194269 |     return ctx.invoke(self.callback, **ctx.params)                                   |
| 1596826194269 |   File "/usr/local/lib/python3.8/site-packages/click/core.py", line 610, in invoke   |
| 1596826194269 |     return callback(*args, **kwargs)                                                 |
| 1596826194269 |   File "cli.py", line 72, in main                                                    |
| 1596826194269 |     mount_and_inspect(drive, image_id, results, debug)                               |
| 1596826194269 |   File "cli.py", line 133, in mount_and_inspect                                      |
| 1596826194269 |     with mount(partition, INSPECT_PATH):                                             |
| 1596826194269 |   File "/usr/lib64/python3.8/contextlib.py", line 113, in __enter__                  |
| 1596826194269 |     return next(self.gen)                                                            |
| 1596826194269 |   File "cli.py", line 216, in mount                                                  |
| 1596826194269 |     mount_result = subprocess.run(                                                   |
| 1596826194269 |   File "/usr/lib64/python3.8/subprocess.py", line 489, in run                        |
| 1596826194269 |     with Popen(*popenargs, **kwargs) as process:                                     |
| 1596826194269 |   File "/usr/lib64/python3.8/subprocess.py", line 854, in __init__                   |
| 1596826194269 |     self._execute_child(args, executable, preexec_fn, close_fds,                     |
| 1596826194269 |   File "/usr/lib64/python3.8/subprocess.py", line 1702, in _execute_child            |
| 1596826194269 |     raise child_exception_type(errno_num, err_msg, err_filename)                     |
| 1596826194269 | FileNotFoundError: [Errno 2] No such file or directory: 'mount'                      |
--------------------------------------------------------------------------------------------------------

Log of subsequent run that did not raise exception:

------------------------------------------------------------------------------------------------------------------------------------
|   timestamp   |                                                     message                                                      |
|---------------|------------------------------------------------------------------------------------------------------------------|
| 1596829764189 | Provided cloud: aws                                                                                              |
| 1596829764189 | Provided drive(s) to inspect: (('ami-00ff566775b0b66e1', '/dev/xvdba'), ('ami-0ff6d47107892dd85', '/dev/xvdbb')) |
| 1596829764189 | Checking drive /dev/xvdba                                                                                        |
| 1596829764190 | Checking partition /dev/xvdba1                                                                                   |
| 1596829764420 | RHEL not found via release file on: /dev/xvdba1                                                                  |
| 1596829764421 | RHEL not found via release file on: /dev/xvdba1                                                                  |
| 1596829764421 | RHEL not found via release file on: /dev/xvdba1                                                                  |
| 1596829764422 | RHEL not found via release file on: /dev/xvdba1                                                                  |
| 1596829764423 | RHEL not found via product certificate on: /dev/xvdba1                                                           |
| 1596829764435 | RHEL not found via enabled repos on: /dev/xvdba1                                                                 |
| 1596829764960 | RHEL not found via signed packages on: /dev/xvdba1                                                               |
| 1596829764960 | No syspurpose.json file found on: /dev/xvdba1                                                                    |
| 1596829764961 | RHEL not found on: ami-00ff566775b0b66e1                                                                         |
| 1596829764994 | Checking drive /dev/xvdbb                                                                                        |
| 1596829764995 | Checking partition /dev/xvdbb1                                                                                   |
| 1596829765048 | RHEL found via release file on: /dev/xvdbb1                                                                      |
| 1596829765049 | RHEL found via release file on: /dev/xvdbb1                                                                      |
| 1596829765049 | RHEL found via release file on: /dev/xvdbb1                                                                      |
| 1596829765050 | RHEL found via release file on: /dev/xvdbb1                                                                      |
| 1596829765051 | RHEL found via product certificate on: /dev/xvdbb1                                                               |
| 1596829765060 | RHEL not found via enabled repos on: /dev/xvdbb1                                                                 |
| 1596829765508 | RHEL not found via signed packages on: /dev/xvdbb1                                                               |
| 1596829765510 | RHEL (version 8.0) found on: ami-0ff6d47107892dd85                                                               |
------------------------------------------------------------------------------------------------------------------------------------

In GitLab by @infinitewarp on Aug 10, 2020, 10:29

changed the description

In GitLab by @infinitewarp on Aug 10, 2020, 10:30

changed the description

In GitLab by @infinitewarp on Aug 12, 2020, 13:52

changed the description

In GitLab by @infinitewarp on Aug 12, 2020, 13:55

changed the description

In GitLab by @infinitewarp on Aug 12, 2020, 13:59

changed the description

In GitLab by @infinitewarp on Aug 13, 2020, 13:30

assigned to @katherine-black

In GitLab by @pakamble on Aug 24, 2020, 10:18

assigned to @pakamble

In GitLab by @katherine-black on Sep 1, 2020, 15:40

mentioned in merge request !86

In GitLab by @pakamble on Sep 9, 2020, 04:34

Multiple iterations of image inspection tests case has run to validate this issue. None of the inspection process has failed.

cloudigrade / houndigrade