broadinstitute / neural-profiling

1 stars 2 forks source link

Maximum batch sizes #6

Open michaelbornholdt opened 3 years ago

michaelbornholdt commented 3 years ago

image

    "model": {
            "name": "efficientnet",
            "crop_generator": "sampled_crop_generator",
            "metrics": ["accuracy", "top_k"],
            "epochs": 2,
            "initialization":"ImageNet",
            "params": {
                "learning_rate": 0.005,
                "batch_size": 64,
                "conv_blocks": 0,
                "feature_dim": 256,
                "pooling": "avg"
            },
            "lr_schedule": "cosine"
        },
    "sampling": {
            "factor": 1,
            "workers": 4,
            "cache_size": 10000
        },
    "validation": {
            "frequency": 1,
            "top_k": 5,
            "batch_size": 40,
            "frame": "val",
            "sample_first_crops": true
michaelbornholdt commented 3 years ago

Profiling with

    "profile": {
      "feature_layer": "Compound",
      "checkpoint": "checkpoint_0010.hdf5",
      "batch_size": 128
    }
}
deepprofiler/__main__.py:180: DtypeWarning: Columns (12) have mixed types.Specify dtype option on import or set low_memory=False.
  dset = deepprofiler.dataset.image_dataset.read_dataset(context.obj["config"], mode='profile')
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:375: UserWarning: The `lr` argument is deprecated, use `l$
  "The `lr` argument is deprecated, use `learning_rate` instead.")
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/utils/generic_utils.py:497: CustomMaskWarning: Custom mask layers require a config and$
  category=CustomMaskWarning)
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "deepprofiler/__main__.py", line 197, in <module>
    cli(obj={})
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "deepprofiler/__main__.py", line 181, in profile
    deepprofiler.learning.profiling.profile(context.obj["config"], dset)
  File "/DeepProfiler/deepprofiler/learning/profiling.py", line 105, in profile
    profile.configure()
  File "/DeepProfiler/deepprofiler/learning/profiling.py", line 35, in configure
    self.profile_crop_generator.start(K.get_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py", line 742, in get_session
    session = _get_session(op_input_list)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/backend.py", line 714, in _get_session
    config=get_default_session_config())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1596, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 711, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: out of memory
michaelbornholdt commented 3 years ago

"profile": { "feature_layer": "Compound", "checkpoint": "checkpoint_0010.hdf5", "batch_size": 32 and 64 } }

Matplotlib created a temporary config/cache directory at /var/lib/condor/execute/slot1/dir_52011/matplotlib-4q3kc0vd because the default path (/.conf$
2021-08-17 20:03:10.420321: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-17 20:03:16.743367: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-08-17 20:03:16.768252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-08-17 20:03:16.768291: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-17 20:03:16.771330: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-08-17 20:03:16.771378: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-08-17 20:03:16.772531: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-08-17 20:03:16.772749: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-08-17 20:03:16.773586: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-08-17 20:03:16.774328: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-08-17 20:03:16.774506: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-17 20:03:16.775931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-08-17 20:03:16.776471: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network $
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-17 20:03:16.785075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-08-17 20:03:16.786631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-08-17 20:03:16.786737: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-17 20:03:17.342112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-17 20:03:17.342162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0
2021-08-17 20:03:17.342172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N
2021-08-17 20:03:17.344450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/devic$
2021-08-17 20:03:17.843576: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3200140000 Hz
2021-08-17 20:03:18.134139: I tensorflow/stream_executor/cuda/cuda_driver.cc:789] failed to allocate 174.69M (183173120 bytes) from device: CUDA_ERRO$
2021-08-17 20:03:36.615088: W tensorflow/core/common_runtime/bfc_allocator.cc:456] Allocator (GPU_0_bfc) ran out of memory trying to allocate 71.56Mi$
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2021-08-17 20:03:36.615281: I tensorflow/core/common_runtime/bfc_allocator.cc:991] BFCAllocator dump for GPU_0_bfc
2021-08-17 20:03:36.615311: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (256):   Total Chunks: 231, Chunks in use: 231. 57.8KiB alloca$
2021-08-17 20:03:36.615323: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (512):   Total Chunks: 77, Chunks in use: 76. 47.8KiB allocate$
2021-08-17 20:03:36.615333: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1024):  Total Chunks: 39, Chunks in use: 38. 44.5KiB allocate$
2021-08-17 20:03:36.615343: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2048):  Total Chunks: 73, Chunks in use: 72. 183.8KiB allocat$
2021-08-17 20:03:36.615389: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4096):  Total Chunks: 53, Chunks in use: 50. 252.2KiB allocat$
2021-08-17 20:03:36.615399: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8192):  Total Chunks: 27, Chunks in use: 20. 298.2KiB allocat$
2021-08-17 20:03:36.615441: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16384):         Total Chunks: 12, Chunks in use: 8. 241.8KiB $
2021-08-17 20:03:36.615453: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (32768):         Total Chunks: 26, Chunks in use: 22. 1.00MiB $
2021-08-17 20:03:36.615485: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (65536):         Total Chunks: 28, Chunks in use: 26. 2.24MiB $
2021-08-17 20:03:36.615495: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (131072):        Total Chunks: 29, Chunks in use: 28. 5.46MiB $
2021-08-17 20:03:36.615504: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (262144):        Total Chunks: 15, Chunks in use: 12. 4.91MiB $
2021-08-17 20:03:36.615513: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (524288):        Total Chunks: 19, Chunks in use: 14. 15.88MiB$
2021-08-17 20:03:36.615522: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (1048576):   Total Chunks: 6, Chunks in use: 4. 8.77MiB al$
2021-08-17 20:03:36.615532: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (2097152):   Total Chunks: 5, Chunks in use: 2. 13.76MiB a$
2021-08-17 20:03:36.615541: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (4194304):   Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615550: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (8388608):   Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615558: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (16777216):  Total Chunks: 1, Chunks in use: 0. 22.25MiB a$
2021-08-17 20:03:36.615567: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (33554432):  Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615605: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (67108864):  Total Chunks: 1, Chunks in use: 1. 81.84MiB a$
2021-08-17 20:03:36.615616: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (134217728):     Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615652: I tensorflow/core/common_runtime/bfc_allocator.cc:998] Bin (268435456):     Total Chunks: 0, Chunks in use: 0. 0B allocat$
2021-08-17 20:03:36.615665: I tensorflow/core/common_runtime/bfc_allocator.cc:1014] Bin for 71.56MiB was 64.00MiB, Chunk State:
2021-08-17 20:03:36.615702: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Next region of size 164855808
2021-08-17 20:03:36.615722: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7f4f7a000000 of size 1280 by op ScratchBuffer action_cou$
2021-08-17 20:03:36.615752: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7f4f7a000500 of size 256 by op Compound/kernel/Initializ$
2021-08-17 20:03:36.615761: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7f4f7a000600 of size 256 by op Compound/kernel/Initializ$
2021-08-17 20:03:36.615770: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] InUse at 7f4f7a000700 of size 2048 by op Com