bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.62k stars 487 forks source link

Release BytePS docker image support for TF2 #431

Open shaowei-su opened 2 years ago

shaowei-su commented 2 years ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] Based on the examples available I think BytePS is already support TF2.0+ but the latest docker image is still pined to TF 1.15: https://github.com/bytedance/byteps/blob/master/docker/Dockerfile#L42

By simply upgrading the version number to tensorflow==2.3.0, the image can be built successfully but ran into errors at runtime for [tensorflow2_keras_mnist.py] (https://github.com/bytedance/byteps/blob/master/example/tensorflow/tensorflow2_keras_mnist.py) error logs:

[2022-03-09 21:19:02.664335: F byteps/common/core_loops.cc:434] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: invalid argument
Aborted (core dumped)
enable NUMA finetune...
Warning: numactl not found. try `sudo apt-get install numactl`.
Traceback (most recent call last):
  File "/usr/local/bin/bpslaunch", line 4, in <module>
    __import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 656, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 1453, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 281, in <module>
    launch_bps()
  File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 267, in launch_bps
    join_threads(t)
  File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 230, in join_threads
    threads[idx].join()
  File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 40, in join
    raise self.exc
  File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 31, in run
    self.ret = self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 199, in worker
    stdout=sys.stdout, stderr=sys.stderr, shell=True)
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 /usr/local/byteps/example/tensorflow/tensorflow2_keras_mnist.py' returned non-zero exit status 134.

Describe the solution you'd like A clear and concise description of what you want to happen. Release an official support for TF2 compatible images

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered. N/A Additional context Add any other context or screenshots about the feature request here.