IBM / tensorflow-large-model-support

Large Model Support in Tensorflow
Apache License 2.0
202 stars 38 forks source link

Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered #50

Closed GF-Huang closed 3 years ago

GF-Huang commented 3 years ago

Machine: 8 vCPU 52 GB RAM + NVIDIA Tesla T4 16 GB

Jupyter-Lab logs:

(base) [root@instance-1 ~]# jupyter-lab --allow-root
[I 14:11:24.779 LabApp] JupyterLab extension loaded from /root/miniconda3/lib/python3.7/site-packages/jupyterlab
[I 14:11:24.779 LabApp] JupyterLab application directory is /root/miniconda3/share/jupyter/lab
[I 14:11:24.781 LabApp] Serving notebooks from local directory: /root/notebooks
[I 14:11:24.781 LabApp] Jupyter Notebook 6.1.4 is running at:
[I 14:11:24.781 LabApp] http://localhost:8888/
[I 14:11:24.781 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 14:11:24.786 LabApp] No web browser found: could not locate runnable browser.
[W 14:11:30.892 LabApp] Could not determine jupyterlab build status without nodejs
[I 14:11:34.581 LabApp] Kernel started: c0cdd4a0-9475-4ca7-9170-474f4ac33a90, name: python3
[I 14:11:37.552 LabApp] Starting buffering for c0cdd4a0-9475-4ca7-9170-474f4ac33a90:38c12e51-d680-4c29-9d5f-13479fe17be2
[I 14:11:38.445 LabApp] Kernel restarted: c0cdd4a0-9475-4ca7-9170-474f4ac33a90
2020-12-11 14:11:39.675971: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-12-11 14:11:52.096582: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-12-11 14:11:52.781050: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-11 14:11:52.781755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1564] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-12-11 14:11:52.781793: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-12-11 14:11:52.784642: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-12-11 14:11:52.787569: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-12-11 14:11:52.788044: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-12-11 14:11:52.790863: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-12-11 14:11:52.792067: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-12-11 14:11:52.797548: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-11 14:11:52.797784: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-11 14:11:52.798540: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-11 14:11:52.799154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1706] Adding visible gpu devices: 0
2020-12-11 14:11:52.799938: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2020-12-11 14:11:52.808151: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2299995000 Hz
2020-12-11 14:11:52.808896: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5592b7176d80 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-11 14:11:52.808933: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-12-11 14:11:52.809563: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-11 14:11:52.810220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1564] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-12-11 14:11:52.810256: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-12-11 14:11:52.810279: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-12-11 14:11:52.810293: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-12-11 14:11:52.810306: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-12-11 14:11:52.810319: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-12-11 14:11:52.810332: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-12-11 14:11:52.810346: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-11 14:11:52.810430: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-11 14:11:52.811104: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-11 14:11:52.811688: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1706] Adding visible gpu devices: 0
2020-12-11 14:11:52.811722: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-12-11 14:11:53.427204: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-11 14:11:53.427249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1111]      0 
2020-12-11 14:11:53.427262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1124] 0:   N 
2020-12-11 14:11:53.427524: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-11 14:11:53.428480: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-11 14:11:53.429272: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-11 14:11:53.430287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1250] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13996 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
2020-12-11 14:11:53.432421: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5592c8d29cf0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-12-11 14:11:53.432446: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
[I 14:13:34.942 LabApp] Saving file at /Untitled.ipynb
2020-12-11 15:33:27.426434: W tensorflow/core/common_runtime/bfc_allocator.cc:685] Allocator (GPU_0_bfc) ran out of memory trying to allocate 7.92GiB (rounded to 8500515328)
Current allocation summary follows.
2020-12-11 15:33:27.426527: I tensorflow/core/common_runtime/bfc_allocator.cc:1199] BFCAllocator dump for GPU_0_bfc
2020-12-11 15:33:27.426540: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (256):  Total Chunks: 3, Chunks in use: 3. 768B allocated for chunks. 768B in use in bin. 16B client-requested in use in bin.
2020-12-11 15:33:27.426549: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (512):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426559: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (1024):         Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2020-12-11 15:33:27.426568: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (2048):         Total Chunks: 1, Chunks in use: 0. 3.2KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426577: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (4096):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426597: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (8192):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426605: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (16384):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426616: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (32768):        Total Chunks: 10, Chunks in use: 10. 637.5KiB allocated for chunks. 637.5KiB in use in bin. 636.6KiB client-requested in use in bin.
2020-12-11 15:33:27.426625: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (65536):        Total Chunks: 1, Chunks in use: 1. 124.0KiB allocated for chunks. 124.0KiB in use in bin. 63.7KiB client-requested in use in bin.
2020-12-11 15:33:27.426698: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (131072):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426707: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (262144):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426715: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (524288):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426724: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (1048576):      Total Chunks: 1, Chunks in use: 0. 1.56MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426733: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (2097152):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426741: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (4194304):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426750: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (8388608):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426758: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (16777216):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426767: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (33554432):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426775: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (67108864):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426787: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (134217728):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-12-11 15:33:27.426797: I tensorflow/core/common_runtime/bfc_allocator.cc:1206] Bin (268435456):    Total Chunks: 11, Chunks in use: 8. 13.67GiB allocated for chunks. 7.92GiB in use in bin. 7.92GiB client-requested in use in bin.
2020-12-11 15:33:27.426806: I tensorflow/core/common_runtime/bfc_allocator.cc:1222] Bin for 7.92GiB was 256.00MiB, Chunk State: 
2020-12-11 15:33:27.426829: I tensorflow/core/common_runtime/bfc_allocator.cc:1228]   Size: 821.09MiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev:   Size: 63.8KiB | Requested Size: 63.7KiB | in_use: 1 | bin_num: -1
2020-12-11 15:33:27.426843: I tensorflow/core/common_runtime/bfc_allocator.cc:1228]   Size: 1.98GiB | Requested Size: 1013.28MiB | in_use: 0 | bin_num: 20, prev:   Size: 1013.28MiB | Requested Size: 1013.28MiB | in_use: 1 | bin_num: -1, next:   Size: 1013.34MiB | Requested Size: 1013.28MiB | in_use: 1 | bin_num: -1
2020-12-11 15:33:27.426855: I tensorflow/core/common_runtime/bfc_allocator.cc:1228]   Size: 2.97GiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev:   Size: 1013.28MiB | Requested Size: 1013.28MiB | in_use: 1 | bin_num: -1, next:   Size: 1013.28MiB | Requested Size: 1013.28MiB | in_use: 1 | bin_num: -1
2020-12-11 15:33:27.426863: I tensorflow/core/common_runtime/bfc_allocator.cc:1235] Next region of size 14675944448
2020-12-11 15:33:27.426874: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa218000000 of size 1280 next 1
2020-12-11 15:33:27.426881: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa218000500 of size 1062499328 next 44
2020-12-11 15:33:27.426906: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] Free  at 7fa257547900 of size 3187498240 next 4
2020-12-11 15:33:27.426913: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa31551d600 of size 1062499328 next 37
2020-12-11 15:33:27.426920: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa354a64a00 of size 1062499328 next 38
2020-12-11 15:33:27.426927: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa393fabe00 of size 1062499328 next 39
2020-12-11 15:33:27.426934: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa3d34f3200 of size 1062499328 next 6
2020-12-11 15:33:27.426940: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa412a3a600 of size 1062499328 next 40
2020-12-11 15:33:27.426947: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa451f81a00 of size 1062499328 next 41
2020-12-11 15:33:27.426954: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] Free  at 7fa4914c8e00 of size 2124998912 next 8
2020-12-11 15:33:27.426960: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa50ff57700 of size 1062564608 next 10
2020-12-11 15:33:27.426967: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] Free  at 7fa54f4aea00 of size 3328 next 26
2020-12-11 15:33:27.426974: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa54f4af700 of size 256 next 27
2020-12-11 15:33:27.426981: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa54f4af800 of size 65280 next 28
2020-12-11 15:33:27.426987: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa54f4bf700 of size 256 next 29
2020-12-11 15:33:27.426994: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa54f4bf800 of size 256 next 30
2020-12-11 15:33:27.427000: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa54f4bf900 of size 65280 next 31
2020-12-11 15:33:27.427006: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa54f4cf800 of size 126976 next 14
2020-12-11 15:33:27.427013: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] Free  at 7fa54f4ee800 of size 1631232 next 43
2020-12-11 15:33:27.427019: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa54f67cc00 of size 65280 next 42
2020-12-11 15:33:27.427026: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa54f68cb00 of size 65280 next 45
2020-12-11 15:33:27.427032: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa54f69ca00 of size 65280 next 46
2020-12-11 15:33:27.427038: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa54f6ac900 of size 65280 next 47
2020-12-11 15:33:27.427044: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa54f6bc800 of size 65280 next 48
2020-12-11 15:33:27.427051: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa54f6cc700 of size 65280 next 49
2020-12-11 15:33:27.427057: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa54f6dc600 of size 65280 next 50
2020-12-11 15:33:27.427659: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] InUse at 7fa54f6ec500 of size 65280 next 51
2020-12-11 15:33:27.427686: I tensorflow/core/common_runtime/bfc_allocator.cc:1255] Free  at 7fa54f6fc400 of size 860971008 next 18446744073709551615
2020-12-11 15:33:27.427693: I tensorflow/core/common_runtime/bfc_allocator.cc:1260]      Summary of in-use Chunks by size: 
2020-12-11 15:33:27.427703: I tensorflow/core/common_runtime/bfc_allocator.cc:1263] 3 Chunks of size 256 totalling 768B
2020-12-11 15:33:27.427713: I tensorflow/core/common_runtime/bfc_allocator.cc:1263] 1 Chunks of size 1280 totalling 1.2KiB
2020-12-11 15:33:27.427723: I tensorflow/core/common_runtime/bfc_allocator.cc:1263] 10 Chunks of size 65280 totalling 637.5KiB
2020-12-11 15:33:27.427730: I tensorflow/core/common_runtime/bfc_allocator.cc:1263] 1 Chunks of size 126976 totalling 124.0KiB
2020-12-11 15:33:27.427738: I tensorflow/core/common_runtime/bfc_allocator.cc:1263] 7 Chunks of size 1062499328 totalling 6.93GiB
2020-12-11 15:33:27.427746: I tensorflow/core/common_runtime/bfc_allocator.cc:1263] 1 Chunks of size 1062564608 totalling 1013.34MiB
2020-12-11 15:33:27.427752: I tensorflow/core/common_runtime/bfc_allocator.cc:1267] Sum Total of in-use chunks: 7.92GiB
2020-12-11 15:33:27.427759: I tensorflow/core/common_runtime/bfc_allocator.cc:1269] total_region_allocated_bytes_: 14675944448 memory_limit_: 14675944448 available bytes: 0 curr_region_allocation_bytes_: 29351888896
2020-12-11 15:33:27.427774: I tensorflow/core/common_runtime/bfc_allocator.cc:1275] Stats: 
Limit:                         14675944448
InUse:                          8500841728
MaxInUse:                      13813929216
NumAllocs:                             120
MaxAllocSize:                   4249997312
BytesInactive:                           0
BytesActive:                    8500841728
PeakBytesActive:               12750059192
TotalBytesReclaimed:           37188580912
CurBytesReclaimed:             25501089536
NumSingleReclaims:                      11
NumFullReclaims:                         1

2020-12-11 15:33:27.427795: W tensorflow/core/common_runtime/bfc_allocator.cc:690] ********____________________*********************************************_____________*********_____
2020-12-11 15:33:27.428397: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-11 15:33:27.581919: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-12-11 15:33:27.581993: W tensorflow/core/framework/op_kernel.cc:1774] OP_REQUIRES failed at cudnn_rnn_ops.cc:1510 : Unknown: Fail to find the dnn implementation.
2020-12-11 15:33:27.582021: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-12-11 15:33:27.582068: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
GF-Huang commented 3 years ago

Does it means the RAM still not enough?

smatzek commented 3 years ago

The short answer is that the neural net you are trying to train requires more than your T4's 16GB of memory to train at some point in its processing.

The long answer is:

To free up GPU memory, Large Model Support swaps out "inactive" tensors which are not needed for the current operation or otherwise tagged as active by a current TensorFlow execution context.

I will use snippets from the log file above to describe what is happening in your specific case. TensorFlow is requesting 7.92GiB of memory for something, likely a tensor: 2020-12-11 15:33:27.426434: W tensorflow/core/common_runtime/bfc_allocator.cc:685] Allocator (GPU_0_bfc) ran out of memory trying to allocate 7.92GiB (rounded to 8500515328)

At this point your GPU currently has ~8.5GiB in use and those memory chunks are marked as "active", required to reside on the GPU and ineligible to swap to system memory:

InUse:                          8500841728
...
BytesActive:                    8500841728

LMS at this point LMS currently has swapped out about 25GB to your system memory:

CurBytesReclaimed:             25501089536

but unfortunately there are no in-active tensor bytes to swap out to make room for the 7.92 GiB request:

BytesInactive:                           0
GF-Huang commented 3 years ago

So, my model can not be trained even if I have a large RAM (52 GB)?

jayfurmanek commented 3 years ago

Well 52G is actually relatively small for this type of work. We typically recommend 64-128G minimum in systems running TensorFlow. When using LMS you can consume much more as shown above.

smatzek commented 3 years ago

The limitation here is not the 52GB of system memory, but rather the 16GB of GPU memory. LMS will move inactive tensors to system memory, that is tensors that are not required for the operation that it is going to be run. Ultimately what is active vs inactive, and thus eligible to be swapped out is determined by operation context scope in TensorFlow.

In this case, the base amount of memory required for whatever option is about to run is greater than the GPU max.

To say it another way, for a given operation on the GPU, there must be enough memory to allow both the inputs and outputs of the operation to reside on the GPU.

In this case, the operation is requiring more memory than the GPU has.

At the point in time this failed, LMS has already swapped 25GB to system memory to free up space for the operations that preceded this one, but it still comes down to requiring enough memory to hold both the inputs and outputs of the operation.

GF-Huang commented 3 years ago

Does that mean that even if I increase the machine's RAM (NOT GPU memory) to higher, it won't help?

jayfurmanek commented 3 years ago

Ah, correct that's likely the problem here. The GPU will always need enough GPU memory to complete a single operation to generate the resultant tensor, LMS doesn't break down operations to ensure they fit - it just swaps resultant tensors main memory so there is enough space to work with bigger ones.

GF-Huang commented 3 years ago

Got it. Thanks all.