ObrienlabsDev / machine-learning

Machine Learning - AI - Tensorflow - Keras - NVidia - Google
MIT License
0 stars 0 forks source link

verify issue running tensorflow/tensorflow:latest-gpu on dual RTX-A4500 with nvlink not on dual RTX-4090 PCIeX8 : ensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence #16

Open obriensystems opened 3 months ago

obriensystems commented 3 months ago

Docker rebuild occurred on tensorflow image change - so far only on new containers build as of a week ago

FROM tensorflow/tensorflow:latest-gpu
WORKDIR /src
COPY /src/tflow.py .
CMD ["python", "tflow.py"]

#RUN pip install -U jupyterlab pandas matplotlib
#EXPOSE 8888
#ENTRYPOINT ["jupyter", "lab","--ip=0.0.0.0","--allow-root","--no-browser"]

[+] Building 137.8s (9/9) FINISHED                                                                           docker:default
 => [internal] load build definition from Dockerfile                                                                   0.0s
 => => transferring dockerfile: 285B                                                                                   0.0s
 => [internal] load .dockerignore                                                                                      0.0s
 => => transferring context: 2B                                                                                        0.0s
 => [internal] load metadata for docker.io/tensorflow/tensorflow:latest-gpu                                            1.2s
 => [auth] tensorflow/tensorflow:pull token for registry-1.docker.io                                                   0.0s
 => [1/3] FROM docker.io/tensorflow/tensorflow:latest-gpu@sha256:4ab9ffddd6ffacc9251ac6439f431eb38d66200d3f52397b5d  135.7s
2024-03-10 04:23:07.220266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 17782 MB memory:  -> device: 1, name: NVIDIA RTX A4500, pci bus id: 0000:02:00.0, compute capability: 8.6
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 ━━━━━━━━━━━━━━━━━━━━ 3s 0us/step
Epoch 1/100
2024-03-10 04:24:13.940526: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8906
2024-03-10 04:24:13.966146: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8906
24/25 ━━━━━━━━━━━━━━━━━━━━ 0s 300ms/step - accuracy: 0.0478 - loss: 11.33632024-03-10 04:24:26.203680: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:26.203757: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
2024-03-10 04:24:26.206671: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:26.206710: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
25/25 ━━━━━━━━━━━━━━━━━━━━ 75s 417ms/step - accuracy: 0.0485 - loss: 11.0441
Epoch 2/100
24/25 ━━━━━━━━━━━━━━━━━━━━ 0s 317ms/step - accuracy: 0.1615 - loss: 8.35482024-03-10 04:24:37.038815: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:37.038866: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
2024-03-10 04:24:37.039859: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:37.039936: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
25/25 ━━━━━━━━━━━━━━━━━━━━ 8s 316ms/step - accuracy: 0.1601 - loss: 8.1802
Epoch 3/100
24/25 ━━━━━━━━━━━━━━━━━━━━ 0s 310ms/step - accuracy: 0.2835 - loss: 7.41432024-03-10 04:24:44.885256: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:44.885302: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
2024-03-10 04:24:44.896965: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
2024-03-10 04:24:44.897036: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
         [[{{node MultiDeviceIteratorGetNextFromShard}}]]
         [[RemoteCall]]
25/25 ━━━━━━━━━━━━━━━━━━━━ 8s 307ms/step - accuracy: 0.2795 - loss: 7.2638
Epoch 4/100
19/25 ━━━━━━━━━━━━━━━━━━━━ 1s 309ms/step - accuracy: 0.3960 - loss: 6.7741Traceback (most recent call last):
  File "/src/tflow.py", line 81, in <module>
    parallel_model.fit(x_train, y_train, epochs=100, batch_size=2048)#7168)#7168)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 118, in error_handler
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 323, in fit
    logs = self.train_function(iterator)