Open michaelbornholdt opened 3 years ago
Profiling with
"profile": {
"feature_layer": "Compound",
"checkpoint": "checkpoint_0010.hdf5",
"batch_size": 128
}
}
deepprofiler/__main__.py:180: DtypeWarning: Columns (12) have mixed types.Specify dtype option on import or set low_memory=False.
dset = deepprofiler.dataset.image_dataset.read_dataset(context.obj["config"], mode='profile')
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:375: UserWarning: The `lr` argument is deprecated, use `l$
"The `lr` argument is deprecated, use `learning_rate` instead.")
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/utils/generic_utils.py:497: CustomMaskWarning: Custom mask layers require a config and$
category=CustomMaskWarning)
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "deepprofiler/__main__.py", line 197, in <module>
cli(obj={})
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1137, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1668, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "deepprofiler/__main__.py", line 181, in profile
deepprofiler.learning.profiling.profile(context.obj["config"], dset)
File "/DeepProfiler/deepprofiler/learning/profiling.py", line 105, in profile
profile.configure()
File "/DeepProfiler/deepprofiler/learning/profiling.py", line 35, in configure
self.profile_crop_generator.start(K.get_session())
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py", line 742, in get_session
session = _get_session(op_input_list)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py", line 714, in _get_session
config=get_default_session_config())
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1596, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 711, in __init__
self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory
"profile": { "feature_layer": "Compound", "checkpoint": "checkpoint_0010.hdf5", "batch_size": 32 and 64 } }
Matplotlib created a temporary config/cache directory at /var/lib/condor/execute/slot1/dir_52011/matplotlib-4q3kc0vd because the default path (/.conf$
2021-08-17 20:03:10.420321: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-17 20:03:16.743367: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-08-17 20:03:16.768252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-08-17 20:03:16.768291: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-17 20:03:16.771330: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-08-17 20:03:16.771378: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-08-17 20:03:16.772531: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-08-17 20:03:16.772749: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-08-17 20:03:16.773586: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-08-17 20:03:16.774328: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-08-17 20:03:16.774506: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-17 20:03:16.775931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-08-17 20:03:16.776471: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network $
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-17 20:03:16.785075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-08-17 20:03:16.786631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-08-17 20:03:16.786737: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-17 20:03:17.342112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-17 20:03:17.342162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-08-17 20:03:17.342172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021-08-17 20:03:17.344450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/devic$
2021-08-17 20:03:17.843576: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3200140000 Hz
2021-08-17 20:03:18.134139: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 174.69M (183173120 bytes) from device: CUDA_ERRO$
2021-08-17 20:03:36.615088: W tensorflow/core/common_runtime/bfc_allocator.cc:456] Allocator (GPU_0_bfc) ran out of memory trying to allocate 71.56Mi$
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2021-08-17 20:03:36.615281: I tensorflow/core/common_runtime/bfc_allocator.cc:991] BFCAllocator dump for GPU_0_bfc
2021-08-17 20:03:36.615311: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (256): Total Chunks: 231, Chunks in use: 231. 57.8KiB alloca$
2021-08-17 20:03:36.615323: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (512): Total Chunks: 77, Chunks in use: 76. 47.8KiB allocate$
2021-08-17 20:03:36.615333: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1024): Total Chunks: 39, Chunks in use: 38. 44.5KiB allocate$
2021-08-17 20:03:36.615343: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2048): Total Chunks: 73, Chunks in use: 72. 183.8KiB allocat$
2021-08-17 20:03:36.615389: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4096): Total Chunks: 53, Chunks in use: 50. 252.2KiB allocat$
2021-08-17 20:03:36.615399: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8192): Total Chunks: 27, Chunks in use: 20. 298.2KiB allocat$
2021-08-17 20:03:36.615441: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16384): Total Chunks: 12, Chunks in use: 8. 241.8KiB $
2021-08-17 20:03:36.615453: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (32768): Total Chunks: 26, Chunks in use: 22. 1.00MiB $
2021-08-17 20:03:36.615485: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (65536): Total Chunks: 28, Chunks in use: 26. 2.24MiB $
2021-08-17 20:03:36.615495: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (131072): Total Chunks: 29, Chunks in use: 28. 5.46MiB $
2021-08-17 20:03:36.615504: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (262144): Total Chunks: 15, Chunks in use: 12. 4.91MiB $
2021-08-17 20:03:36.615513: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (524288): Total Chunks: 19, Chunks in use: 14. 15.88MiB$
2021-08-17 20:03:36.615522: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1048576): Total Chunks: 6, Chunks in use: 4. 8.77MiB al$
2021-08-17 20:03:36.615532: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2097152): Total Chunks: 5, Chunks in use: 2. 13.76MiB a$
2021-08-17 20:03:36.615541: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4194304): Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615550: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8388608): Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615558: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16777216): Total Chunks: 1, Chunks in use: 0. 22.25MiB a$
2021-08-17 20:03:36.615567: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615605: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (67108864): Total Chunks: 1, Chunks in use: 1. 81.84MiB a$
2021-08-17 20:03:36.615616: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615652: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (268435456): Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615665: I tensorflow/core/common_runtime/bfc_allocator.cc:1014] Bin for 71.56MiB was 64.00MiB, Chunk State:
2021-08-17 20:03:36.615702: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 164855808
2021-08-17 20:03:36.615722: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7f4f7a000000 of size 1280 by op ScratchBuffer action_cou$
2021-08-17 20:03:36.615752: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7f4f7a000500 of size 256 by op Compound/kernel/Initializ$
2021-08-17 20:03:36.615761: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7f4f7a000600 of size 256 by op Compound/kernel/Initializ$
2021-08-17 20:03:36.615770: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7f4f7a000700 of size 2048 by op Com