Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Based on the examples available I think BytePS is already support TF2.0+ but the latest docker image is still pined to TF 1.15: https://github.com/bytedance/byteps/blob/master/docker/Dockerfile#L42
[2022-03-09 21:19:02.664335: F byteps/common/core_loops.cc:434] Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: invalid argument
Aborted (core dumped)
enable NUMA finetune...
Warning: numactl not found. try `sudo apt-get install numactl`.
Traceback (most recent call last):
File "/usr/local/bin/bpslaunch", line 4, in <module>
__import__('pkg_resources').run_script('byteps==0.2.5', 'bpslaunch')
File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 656, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 1453, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 281, in <module>
launch_bps()
File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 267, in launch_bps
join_threads(t)
File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 230, in join_threads
threads[idx].join()
File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 40, in join
raise self.exc
File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 31, in run
self.ret = self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.6/dist-packages/byteps-0.2.5-py3.6-linux-x86_64.egg/EGG-INFO/scripts/bpslaunch", line 199, in worker
stdout=sys.stdout, stderr=sys.stderr, shell=True)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python3 /usr/local/byteps/example/tensorflow/tensorflow2_keras_mnist.py' returned non-zero exit status 134.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Release an official support for TF2 compatible images
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
N/A
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] Based on the examples available I think BytePS is already support TF2.0+ but the latest docker image is still pined to TF
1.15
: https://github.com/bytedance/byteps/blob/master/docker/Dockerfile#L42By simply upgrading the version number to
tensorflow==2.3.0
, the image can be built successfully but ran into errors at runtime for [tensorflow2_keras_mnist.py] (https://github.com/bytedance/byteps/blob/master/example/tensorflow/tensorflow2_keras_mnist.py) error logs:Describe the solution you'd like A clear and concise description of what you want to happen. Release an official support for TF2 compatible images
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered. N/A Additional context Add any other context or screenshots about the feature request here.