box / ClusterRunner

ClusterRunner makes it easy to parallelize test suites across your infrastructure in the fastest and most efficient way possible.
https://clusterrunner.com
Apache License 2.0
182 stars 46 forks source link

ClusterRunner slave died with MemoryError #346

Open pwsouth opened 7 years ago

pwsouth commented 7 years ago

Command: export ATOM_ID="0"; export PROJECT_DIR="/tmp/clusterrunner_build_symlinks/80313cfa-576c-430a-b92c-16597aed1619"; export BUILD_EXECUTOR_INDEX="138"; export EXECUTOR_INDEX="8"; export ARTIFACT_DIR="/home/jenkins/.clusterrunner/artifacts/1516/artifact_27_0"; export MACHINE_EXECUTOR_INDEX="8"; export TESTPATH="$PROJECT_DIR/test/php/integration/modular/file-system/operation/legacy/src/Services/File_System/FolderOperation/Iterator/ItemsForUserCrawlerTest.php"; cd $PROJECT_DIR && PHPUNIT_THREAD_INDEX=$EXECUTOR_INDEX $PROJECT_DIR/ci_phpunit --log-junit $ARTIFACT_DIR/result.xml $TESTPATH && test -f $ARTIFACT_DIR/result.xml && xmllint --noout $ARTIFACT_DIR/result.xml
Exit code: 255
Console output: Mon May 22 11:22:04 PDT 2017
timeout 3600 vendor/phpunit/phpunit/phpunit --log-junit /home/jenkins/.clusterrunner/artifacts/1516/artifact_27_0/result.xml /tmp/clusterrunner_build_symlinks/80313cfa-576c-430a-b92c-16597aed1619/test/php/integration/modular/file-system/operation/legacy/src/Services/File... (total output length: 4374603)

[2017-05-22 11:26:26.799] 20917 INFO    Bld1516-Sub27   cluster_slave   Build 1516, Subjob 27 completed and sent results to master.
[2017-05-22 12:25:00.961] 20917 ERROR   Bld1516-Sub329  unhandled_excep Unhandled exception handler caught exception.
Traceback (most recent call last):
  File "/home/jenkins/ClusterRunnerBuild/app/util/safe_thread.py", line 18, in run
  File "/usr/local/lib/python3.4/threading.py", line 868, in run
  File "/home/jenkins/ClusterRunnerBuild/app/slave/cluster_slave.py", line 302, in _execute_subjob
  File "/home/jenkins/ClusterRunnerBuild/app/slave/subjob_executor.py", line 100, in execute_subjob
  File "/home/jenkins/ClusterRunnerBuild/app/slave/subjob_executor.py", line 144, in _execute_atom_command
  File "/home/jenkins/ClusterRunnerBuild/app/project_type/git.py", line 245, in execute_command_in_project
  File "/home/jenkins/ClusterRunnerBuild/app/project_type/project_type.py", line 231, in execute_command_in_project
  File "/home/jenkins/ClusterRunnerBuild/app/project_type/project_type.py", line 316, in _read_file_contents_and_close
MemoryError
[2017-05-22 12:25:01.134] 20917 DEBUG   Bld1516-Sub329  unhandled_excep Executing teardown callback: <bound method ClusterSlave._disconnect_from_master of <app.slave.cluster_slave.ClusterSlave object at 0x7fcdef0f7b00>>
[2017-05-22 12:25:01.173] 20917 INFO    Bld1516-Sub329  cluster_slave   Notifying master that this slave is disconnecting.
[2017-05-22 12:25:01.202] 20917 DEBUG   Bld1516-Sub329  unhandled_excep Executing teardown callback: <bound method ClusterSlave._do_build_teardown_and_reset of <app.slave.cluster_slave.ClusterSlave object at 0x7fcdef0f7b00>>
[2017-05-22 12:25:01.202] 20917 INFO    Bld1516-Sub329  cluster_slave   Executing teardown for build 1516.
[2017-05-22 12:25:01.341] 20917 DEBUG   Bld1516-Sub329  git             Executing command in project: export PROJECT_DIR="/tmp/clusterrunner_build_symlinks/80313cfa-576c-430a-b92c-16597aed1619"; sudo rm -rf /box/var/log/phpunit
[2017-05-22 12:25:01.711] 20917 DEBUG   Bld1516-Sub329  git             Command completed with exit code 0.
[2017-05-22 12:25:01.711] 20917 INFO    Bld1516-Sub329  git             Build teardown completed successfully.
[2017-05-22 12:25:01.712] 20917 INFO    Bld1516-Sub329  git             ProjectType teardown complete.
[2017-05-22 12:25:01.712] 20917 INFO    Bld1516-Sub329  cluster_slave   Build teardown complete for build 1516.
[2017-05-22 12:25:01.713] 20917 DEBUG   Bld1516-Sub329  unhandled_excep Executing teardown callback: <function ServiceSubcommand._write_pid_file.<locals>.remove_pid_file at 0x7fcdee2488c8>
[2017-05-22 12:25:01.739] 20917 DEBUG   Bld1516-Sub329  unhandled_excep Executing teardown callback: functools.partial(<bound method EPollIOLoop.add_callback of <tornado.platform.epoll.EPollIOLoop object at 0x7fcdeec887b8>>, callback=<bound method EPollIOLoop.stop of <tornado.platform.epoll.EPollIOLoop object at 0x7fcdeec887b8>>)
[2017-05-22 12:25:01.813] 20917 NOTICE  SlaveTornadoThr subcommand      Slave server was stopped.
tjlee0909 commented 7 years ago

Ah, we do something silly here where we read the entire console output contents of an atom run into memory.

https://github.com/box/ClusterRunner/blob/master/app/project_type/project_type.py#L216

console_output = self._read_file_contents_and_close(output_file)

I haven't looked too close yet, but we should probably tail the log contents up to some length instead.

josephharrington commented 7 years ago

Good point. We could do something similar to what we do for console output.