SeattleTestbed / nodemanager

Remote control server for SeattleTestbed nodes
MIT License
0 stars 10 forks source link

Unit tests systematically failing on Linux VirtualBox guest? #114

Closed aaaaalbert closed 9 years ago

aaaaalbert commented 9 years ago

I have a Linux (Ubuntu 14.04 LTS) guest running Python 2.7.6 in VirtualBox on a Mac OS X 10.6.8 host, and the nodemanager unit tests fail because nmmain.py does not start in --test-mode. It will run just fine when started from the console, but not when called through this Popen statement, https://github.com/SeattleTestbed/nodemanager/blob/master/tests/ut_nm_subprocess.py#L9

On the other hand, the "bare-metal" Ubuntu 12.04.3 box with Python 2.7.3 runs the tests just fine!

I'm not sure right now if virtualization, the Python version, or the Linux version is the culprit. I'll run a few tests in the VBox using virtualenv to check the Python versions first.

aaaaalbert commented 9 years ago

The "systematic" bit is that all of the unit tests fail because nmmain doesn't start correctly.

vladimir-v-diaz commented 9 years ago

Testing on Ubuntu 14.04 LTS (non-virtualized) and Python 2.7.6. Not sure if these test failures are expected (you mentioned issues with PlanetLab), but I'm still getting "invalid ip: ..." exceptions.

For exampe:

Exception (with class '.NMClientException'): RepyArgumentError("Provided destip is not valid! IP: 'f1f891e5db2edeaf18be8708b4ec8d2a825503e4'",)
---

..............................Expected..............................
None
--------------------------------------------------------------------------------
Now stopping subprocess: ut_nm_subprocess.py
Traceback (most recent call last):
  File "ut_nm_subprocess.py", line 35, in <module>
    pidportion = firstline.split(':')[1]
IndexError: list index out of range

I get these test failures if I run the unit tests individually, and using the -a & -m options. $ python utf.py -a $ python utf.py -m nm $ python utf.py -f ut_nm_addfiletovessel.r2py

Result of running an individual unit test:

$ python utf.py -f ut_nm_addfiletovessel.r2py
Testing module: nm
Now starting subprocess: ut_nm_subprocess.py
Generating user keys...
Generating key...
Writing config file...
Writing vessel dictionary...
    Running: ut_nm_addfiletovessel.r2py                         [ FAIL ]
--------------------------------------------------------------------------------
Standard error :
..............................Produced..............................
---
Uncaught exception!
---
Following is a full traceback, and a user traceback.
The user traceback excludes non-user modules. The most recent call is displayed last.

Full debugging traceback:
  "repy.py", line 154, in execute_namespace_until_completion
  "/home/vlad/projects/seattletestbed/nodemanager/RUNNABLE/virtual_namespace.py", line 117, in evaluate
  "/home/vlad/projects/seattletestbed/nodemanager/RUNNABLE/safe.py", line 588, in safe_run
  "dylink.r2py", line 546, in <module>
  "dylink.r2py", line 407, in dylink_dispatch
  "dylink.r2py", line 520, in evaluate
  "/home/vlad/projects/seattletestbed/nodemanager/RUNNABLE/virtual_namespace.py", line 117, in evaluate
  "/home/vlad/projects/seattletestbed/nodemanager/RUNNABLE/safe.py", line 588, in safe_run
  "ut_nm_addfiletovessel.r2py", line 36, in <module>
  "nmclient.r2py", line 298, in nmclient_createhandle

User traceback:
  "dylink.r2py", line 546, in <module>
  "dylink.r2py", line 407, in dylink_dispatch
  "dylink.r2py", line 520, in evaluate
  "ut_nm_addfiletovessel.r2py", line 36, in <module>
  "nmclient.r2py", line 298, in nmclient_createhandle

Exception (with class '.NMClientException'): RepyArgumentError("Provided destip is not valid! IP: 'f1f891e5db2edeaf18be8708b4ec8d2a825503e4'",)
---

..............................Expected..............................
None
--------------------------------------------------------------------------------
Now stopping subprocess: ut_nm_subprocess.py
Traceback (most recent call last):
  File "ut_nm_subprocess.py", line 35, in <module>
    pidportion = firstline.split(':')[1]
IndexError: list index out of range
choksi81 commented 9 years ago

Do you see repyV2 directory created inside your runnable, and do bunch of files (necessary for repy runtime ) exists inside repyV2? Do you get the same errors while running it on MAC OS X or any other OS? I tried running tests on my OS X again, and it runs fine.

vladimir-v-diaz commented 9 years ago

The RUNNABLE/repyV2 subdirectory does exist. It contains 135 modules. I'll also run the unit tests on the Windows machine in the lab.

-Vlad


PGP key fingerprint = ACCF 9DCA 73B9 862F 93C5 6608 63F8 90AA 1D25 3935

On Fri, Jan 23, 2015 at 2:56 PM, choksi81 notifications@github.com wrote:

Do you see repyV2 directory created inside your runnable, and do bunch of files (necessary for repy runtime ) exists inside repyV2? Do you get the same errors while running it on MAC OS X or any other OS? I tried running tests on my OS X again, and it runs fine.

— Reply to this email directly or view it on GitHub https://github.com/SeattleTestbed/nodemanager/issues/114#issuecomment-71255042 .

vladimir-v-diaz commented 9 years ago

The unit tests do pass on Windows 7 (non-virtualized) and Python 2.7.5.

C:\Users\vlad\projects\nodemanager\RUNNABLE>python utf.py -f ut_nm_addfiletovessel.r2py
Testing module: nm
Now starting subprocess: ut_nm_subprocess.py
Generating user keys...
Generating key...
Writing config file...
Writing vessel dictionary...
        Running: ut_nm_addfiletovessel.r2py                         [ PASS ]
Now stopping subprocess: ut_nm_subprocess.py
vladimir-v-diaz commented 9 years ago

Note: Running python utf.py -m nm on Windows 7 (non-virtualized) and Python 2.7.5, all the unit tests pass except ut_nm_fastclient.py

        Running: ut_nm_fastclient.py                                [ FAIL ]
--------------------------------------------------------------------------------
Standard error :
..............................Produced..............................
Traceback (most recent call last):
  File "ut_nm_fastclient.py", line 41, in <module>
    nmhandle = fastnmclient.nmclient_createhandle(getmyip(), 1224)
  File "C:\Users\vlad\projects\nodemanager\RUNNABLE\fastnmclient.py", line 275, in nmclient_createhandle
    raise NMClientException, e
fastnmclient.NMClientException: The connection was refused!

..............................Expected..............................
None
--------------------------------------------------------------------------------
choksi81 commented 9 years ago

Correct! Because of issue with Affix overriding, as noted by Albert, only ut_nm_fastclient.py is failing on all platforms. Others are getting passed.

vladimir-v-diaz commented 9 years ago

Addressing the issue with the Nodemanger unit test failures on Linux and Python 2.7.6...

I discovered that starting nmmain.py with the --test-mode and --foreground command-line options does not cause a "systematic" failure. That is, line 9 in ut_nm_subprocess.py was changed to:

nmproc = subprocess.Popen([sys.executable, "nmmain.py", "--test-mode", "--foreground"], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

The culprit might be the daemon.daemonize() call (--foreground is false by default, so an attempt to daemonize in nmmain.py is made): https://github.com/SeattleTestbed/nodemanager/blob/master/nmmain.py#L481-L483

On Windows, this function simply returns (essentially --foreground is true): https://github.com/SeattleTestbed/nodemanager/blob/master/daemon.py#L34-L36

On Linux and Python 2.7.6, one of the child processes gets stuck in an infinite loop here: https://github.com/SeattleTestbed/nodemanager/blob/master/daemon.py#L47-L49 The process id of the parent process does not change. If I remove the cause of the infinite loop in daemon.py, the tests pass on Linux.

On other non-Windows systems, something different happens. This module (borrowed code from 2001) needs to be carefully reviewed.

JustinCappos commented 9 years ago

Who is the best person to review this?

On Sat, Jan 24, 2015 at 4:06 PM, Vladimir Diaz notifications@github.com wrote:

Addressing the issue with the Nodemanger unit test failures on Linux...

I discovered that starting nmmain.py with the --test-mode and --foreground command-line options does not cause a "systematic" failure. That is, line 9 in ut_nm_subprocess.py https://github.com/SeattleTestbed/nodemanager/blob/master/tests/ut_nm_subprocess.py#L9 was changed to:

nmproc = subprocess.Popen([sys.executable, "nmmain.py", "--test-mode", "--foreground"], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

The culprit might be the daemon.daemonize() call (--foreground is false by default, so an attempt to daemonize in nmmain.py is made):

https://github.com/SeattleTestbed/nodemanager/blob/master/nmmain.py#L481-L483

On Windows, this function simply returns (essentially --foreground is true): https://github.com/SeattleTestbed/nodemanager/blob/master/daemon.py#L34-L36

On Linux, one of the child processes gets stuck in an infinite loop here: https://github.com/SeattleTestbed/nodemanager/blob/master/daemon.py#L47-L49 The process id of the parent process does not change. If I remove the cause of the infinite loop in daemon.py, the tests pass on Linux.

On other non-Windows systems, something different happens. This module (borrowed code from 2001) needs to be carefully reviewed.

— Reply to this email directly or view it on GitHub https://github.com/SeattleTestbed/nodemanager/issues/114#issuecomment-71337262 .

vladimir-v-diaz commented 9 years ago

I can help with reviews and diagnosing problems (still need to focus on TUF).

Another issue I noticed on Linux 14.04 LTS (non-virtualized) and Python 2.7.6:

nmmain.py would often fail to start the nodemanager on the default 1224 port. If the 1224 port failed, it would try the next available one in configurations['ports'] and eventually find one to listen on. Unfortunately, the unit tests would fail to create a handle to these non-1224 ports because they (unit tests) would attempt to create a handle only to port 1224: https://github.com/SeattleTestbed/nodemanager/blob/master/tests/ut_nm_addfiletovessel.r2py#L34

This problematic setup results in unit test failures that show up as Exception (with class '.NMClientException'): RepyArgumentError("Provided destip is not valid! IP: 'f1f891e5db2edeaf18be8708b4ec8d2a825503e4'",), and the nodemanager.old log file indicating that the nodemanger could not listen on the 1224 port (and others, in certain cases).

Specifying that ut_nm_subprocess start nmmain.py with --test-mode and --foreground, and using the -r random ports option with build.py (use random nodemanger ports in the 53000-54000 range) results in unit tests that consistently pass on Linux 14.04 and Python 2.7.6.

$ python build.py -t -r
Building into /home/vlad/projects/seattletestbed/nodemanager/RUNNABLE

[ Randomports option was chosen ]
--------------------------------------------------
Randomly chosen ports:  ['52041', '52620', '52302', '52857', '52625']
Chosen random nodemanager port: 53225
--------------------------------------------------

Done building!
$ 

However, the ut_nm_validate_connection_timeout.r2py and ut_nm_validate_connection_limit.r2py unit tests do complain about ports in the 53000-54000 range. All the other unit tests appear to consistently pass.

User traceback:
  "dylink.r2py", line 546, in <module>
  "dylink.r2py", line 407, in dylink_dispatch
  "dylink.r2py", line 520, in evaluate
  "ut_nm_validate_connection_limit.r2py", line 50, in <module>
  "coordinationaffix.r2py", line 190, in openconnection
  "tcprelayaffix.r2py", line 155, in openconnection
  "sockettimeout.r2py", line 301, in timeout_openconnection

Exception (with class 'exception_hierarchy.ResourceForbiddenError'): Provided localport is not allowed! Port: 52728
aaaaalbert commented 9 years ago

Guess what, the init process is to blame! On my Ubuntu 14.04.1 VM, I have one "classical" init process with PID 1, and another one run as init --user that seems to be/become the parent of everything spawned in the GUI. This other init has a different PID. Therefore, daemon.py's check for the parent PID having become 1 as an indicator that the child process has been adopted by init fails.

As a workaround, we could consider a change in the ppid (from the parent process's to something else, maybe 1) as the starting signal for the child process.


However, this issue doesn't seem to be related to VM-vs-bare-metal. I'm closing it, and will open a new issue with more accurate description....

aaaaalbert commented 9 years ago

See #115 for an updated writeup of the mechanisms leading to this error, and a fix!