biigle / maia

:m: BIIGLE module for the Machine Learning Assisted Image Annotation method
GNU General Public License v3.0
2 stars 3 forks source link

Fix novelty detection memory adjustment #64

Closed mzur closed 1 year ago

mzur commented 3 years ago

The method to adjust to available memory used by the novelty detection algorithm is flawed. We had a 3083 × 922 px image where it failed due to memory exhaustion. This shouldn't happen. Review the method and fix it.

Maybe memory used during training isn't freed?

References #15

Error:

Error while executing python command '/var/www/vendor/biigle/maia/src/config/../resources/scripts/novelty-detection/DetectionRunner.py /var/www/storage/maia_jobs/maia-459-novelty-detection/input.json':
WARNING:tensorflow:From /var/www/vendor/biigle/maia/src/resources/scripts/novelty-detection/Autoencoder.py:8: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /var/www/vendor/biigle/maia/src/resources/scripts/novelty-detection/Autoencoder.py:15: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /var/www/vendor/biigle/maia/src/resources/scripts/novelty-detection/Autoencoder.py:20: The name tf.losses.mean_squared_error is deprecated. Please use tf.compat.v1.losses.mean_squared_error instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/losses/losses_impl.py:121: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /var/www/vendor/biigle/maia/src/resources/scripts/novelty-detection/Autoencoder.py:23: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /var/www/vendor/biigle/maia/src/resources/scripts/novelty-detection/Autoencoder.py:25: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2021-03-04 22:40:16.464392: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-03-04 22:40:16.471287: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2349995000 Hz
2021-03-04 22:40:16.472406: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5d3c6c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-04 22:40:16.472422: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-03-04 22:40:16.474157: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-03-04 22:40:17.068599: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-04 22:40:17.069437: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5d775d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-03-04 22:40:17.069510: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2021-03-04 22:40:17.069898: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-04 22:40:17.071097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:07.0
2021-03-04 22:40:17.071445: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-04 22:40:17.073071: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-03-04 22:40:17.074538: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-03-04 22:40:17.074901: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-03-04 22:40:17.076780: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-03-04 22:40:17.078318: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-03-04 22:40:17.082471: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-04 22:40:17.082676: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-04 22:40:17.083376: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-04 22:40:17.083958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2021-03-04 22:40:17.084001: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-03-04 22:40:17.085117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-04 22:40:17.085133: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 
2021-03-04 22:40:17.085139: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N 
2021-03-04 22:40:17.085262: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-04 22:40:17.085905: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-03-04 22:40:17.086497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14257 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:07.0, compute capability: 7.5)
WARNING:tensorflow:From /var/www/vendor/biigle/maia/src/resources/scripts/novelty-detection/Autoencoder.py:76: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From /var/www/vendor/biigle/maia/src/resources/scripts/novelty-detection/AutoencoderSaliencyDetector.py:23: The name tf.read_file is deprecated. Please use tf.io.read_file instead.

WARNING:tensorflow:From /var/www/vendor/biigle/maia/src/resources/scripts/novelty-detection/AutoencoderSaliencyDetector.py:24: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
WARNING:tensorflow:From /var/www/vendor/biigle/maia/src/resources/scripts/novelty-detection/AutoencoderSaliencyDetector.py:36: calling extract_image_patches (from tensorflow.python.ops.array_ops) with ksizes is deprecated and will be removed in a future version.
Instructions for updating:
ksizes is deprecated, use sizes instead
Cluster 1 of 1
  Training
2021-03-04 22:40:21.250768: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
Epoch: 0001 (33.31507)
Epoch: 0002 (3.83539)
Epoch: 0003 (3.84896)
Epoch: 0004 (3.39507)
Epoch: 0005 (2.23081)
Epoch: 0006 (9.98762)
Epoch: 0007 (2.68571)
Epoch: 0008 (2.33425)
Epoch: 0009 (2.14817)
Epoch: 0010 (2.03167)
Epoch: 0011 (2.00954)
Epoch: 0012 (1.76585)
Epoch: 0013 (1.73161)
Epoch: 0014 (1.59374)
Epoch: 0015 (1.53553)
Epoch: 0016 (1.47660)
Epoch: 0017 (1.48830)
Epoch: 0018 (1.43478)
Epoch: 0019 (1.49796)
Epoch: 0020 (1.48689)
Epoch: 0021 (1.48630)
Epoch: 0022 (1.46198)
Epoch: 0023 (1.42221)
Epoch: 0024 (1.42739)
Epoch: 0025 (1.33526)
Epoch: 0026 (1.36771)
Epoch: 0027 (1.34916)
Epoch: 0028 (1.36724)
Epoch: 0029 (1.24381)
Epoch: 0030 (1.30622)
Epoch: 0031 (1.29162)
Epoch: 0032 (1.20939)
Epoch: 0033 (1.64384)
Epoch: 0034 (1.22873)
Epoch: 0035 (1.22062)
Epoch: 0036 (1.44807)
Epoch: 0037 (1.52187)
Epoch: 0038 (1.21204)
Epoch: 0039 (1.18472)
Epoch: 0040 (1.14701)
Epoch: 0041 (1.21717)
Epoch: 0042 (1.15440)
Epoch: 0043 (1.23083)
Epoch: 0044 (1.20427)
Epoch: 0045 (1.62167)
Epoch: 0046 (1.10564)
Epoch: 0047 (1.28520)
Epoch: 0048 (1.29279)
Epoch: 0049 (1.18103)
Epoch: 0050 (1.27689)
Epoch: 0051 (1.23081)
Epoch: 0052 (1.24011)
Epoch: 0053 (1.20277)
Epoch: 0054 (1.28467)
Epoch: 0055 (1.26418)
Epoch: 0056 (1.45328)
Epoch: 0057 (1.18034)
Epoch: 0058 (1.37089)
Epoch: 0059 (1.37368)
Epoch: 0060 (1.17760)
Epoch: 0061 (1.16752)
Epoch: 0062 (1.31816)
Epoch: 0063 (1.20889)
Epoch: 0064 (1.19829)
Epoch: 0065 (1.15326)
Epoch: 0066 (1.44980)
Epoch: 0067 (1.27623)
Epoch: 0068 (1.10985)
Epoch: 0069 (1.21095)
Epoch: 0070 (1.39489)
Epoch: 0071 (1.34359)
Epoch: 0072 (1.26425)
Epoch: 0073 (1.15480)
Epoch: 0074 (1.26455)
Epoch: 0075 (1.23717)
Epoch: 0076 (1.46483)
Epoch: 0077 (1.19069)
Epoch: 0078 (1.16280)
Epoch: 0079 (1.26649)
Epoch: 0080 (1.16117)
Epoch: 0081 (1.55802)
Epoch: 0082 (1.42901)
Epoch: 0083 (1.27686)
Epoch: 0084 (1.16637)
Epoch: 0085 (1.21426)
Epoch: 0086 (1.22146)
Epoch: 0087 (1.33674)
Epoch: 0088 (1.16146)
Epoch: 0089 (1.25079)
Epoch: 0090 (1.44102)
Epoch: 0091 (1.26170)
Epoch: 0092 (1.20127)
Epoch: 0093 (1.28884)
Epoch: 0094 (1.35344)
Epoch: 0095 (1.31947)
Epoch: 0096 (1.31142)
Epoch: 0097 (1.31368)
Epoch: 0098 (1.20059)
Epoch: 0099 (1.41270)
Epoch: 0100 (1.23024)
  Image 1 of 1 (#1403576)
2021-03-04 22:44:03.000880: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 5.24GiB (rounded to 5628133888).  Current allocation summary follows.
2021-03-04 22:44:03.001008: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256):   Total Chunks: 20, Chunks in use: 20. 5.0KiB allocated for chunks. 5.0KiB in use in bin. 68B client-requested in use in bin.
2021-03-04 22:44:03.001025: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (512):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001040: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1024):  Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2021-03-04 22:44:03.001054: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2048):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001070: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4096):  Total Chunks: 7, Chunks in use: 7. 52.5KiB allocated for chunks. 52.5KiB in use in bin. 51.2KiB client-requested in use in bin.
2021-03-04 22:44:03.001083: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8192):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001097: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16384):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001112: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (32768):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001130: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (65536):     Total Chunks: 7, Chunks in use: 7. 512.8KiB allocated for chunks. 512.8KiB in use in bin. 512.0KiB client-requested in use in bin.
2021-03-04 22:44:03.001146: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (131072):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001162: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (262144):    Total Chunks: 1, Chunks in use: 0. 452.5KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001177: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (524288):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001207: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1048576):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001222: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2097152):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001236: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4194304):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001250: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8388608):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001265: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16777216):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001279: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (33554432):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001293: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (67108864):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-03-04 22:44:03.001309: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728):     Total Chunks: 10, Chunks in use: 10. 1.50GiB allocated for chunks. 1.50GiB in use in bin. 1.31GiB client-requested in use in bin.
2021-03-04 22:44:03.001325: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456):     Total Chunks: 6, Chunks in use: 4. 10.50GiB allocated for chunks. 6.27GiB in use in bin. 6.03GiB client-requested in use in bin.
2021-03-04 22:44:03.001340: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 5.24GiB was 256.00MiB, Chunk State: 
2021-03-04 22:44:03.001364: I tensorflow/core/common_runtime/bfc_allocator.cc:891]   Size: 1.48GiB | Requested Size: 32.53MiB | in_use: 0 | bin_num: 20, prev:   Size: 536.65MiB | Requested Size: 536.65MiB | in_use: 1 | bin_num: -1
2021-03-04 22:44:03.001382: I tensorflow/core/common_runtime/bfc_allocator.cc:891]   Size: 2.76GiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev:   Size: 5.24GiB | Requested Size: 5.24GiB | in_use: 1 | bin_num: -1
2021-03-04 22:44:03.001393: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 8589934592
2021-03-04 22:44:03.001408: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f2e90000000 next 51 of size 5628133888
2021-03-04 22:44:03.001419: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f2fdf768200 next 18446744073709551615 of size 2961800704
2021-03-04 22:44:03.001429: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 2147483648
2021-03-04 22:44:03.001440: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f309c000000 next 49 of size 562723328
2021-03-04 22:44:03.001451: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f30bd8a7a00 next 18446744073709551615 of size 1584760320
2021-03-04 22:44:03.001462: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 1073741824
2021-03-04 22:44:03.001473: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3120000000 next 18 of size 140197888
2021-03-04 22:44:03.001483: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f31285b4000 next 19 of size 140197888
2021-03-04 22:44:03.001494: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3130b68000 next 20 of size 140197888
2021-03-04 22:44:03.001505: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f313911c000 next 21 of size 140197888
2021-03-04 22:44:03.001520: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f31416d0000 next 33 of size 140197888
2021-03-04 22:44:03.001531: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3149c84000 next 35 of size 140197888
2021-03-04 22:44:03.001543: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3152238000 next 18446744073709551615 of size 232554496
2021-03-04 22:44:03.001553: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 536870912
2021-03-04 22:44:03.001564: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3160000000 next 12 of size 140197888
2021-03-04 22:44:03.001575: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f31685b4000 next 14 of size 140197888
2021-03-04 22:44:03.001586: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3170b68000 next 18446744073709551615 of size 256475136
2021-03-04 22:44:03.001596: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 268435456
2021-03-04 22:44:03.001607: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3180000000 next 18446744073709551615 of size 268435456
2021-03-04 22:44:03.001618: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 268435456
2021-03-04 22:44:03.001628: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3190000000 next 18446744073709551615 of size 268435456
2021-03-04 22:44:03.001639: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 1048576
2021-03-04 22:44:03.001650: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235400000 next 1 of size 256
2021-03-04 22:44:03.001661: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235400100 next 2 of size 256
2021-03-04 22:44:03.001671: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235400200 next 3 of size 256
2021-03-04 22:44:03.001682: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235400300 next 4 of size 256
2021-03-04 22:44:03.001693: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235400400 next 5 of size 7680
2021-03-04 22:44:03.001704: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235402200 next 7 of size 75008
2021-03-04 22:44:03.001715: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235414700 next 9 of size 7680
2021-03-04 22:44:03.001726: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235416500 next 10 of size 75008
2021-03-04 22:44:03.001737: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235428a00 next 13 of size 1280
2021-03-04 22:44:03.001747: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235428f00 next 15 of size 256
2021-03-04 22:44:03.001758: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235429000 next 16 of size 7680
2021-03-04 22:44:03.001769: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323542ae00 next 22 of size 7680
2021-03-04 22:44:03.001779: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323542cc00 next 23 of size 7680
2021-03-04 22:44:03.001790: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323542ea00 next 24 of size 75008
2021-03-04 22:44:03.001801: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235440f00 next 25 of size 256
2021-03-04 22:44:03.001811: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235441000 next 26 of size 75008
2021-03-04 22:44:03.001822: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235453500 next 27 of size 75008
2021-03-04 22:44:03.001833: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235465a00 next 28 of size 256
2021-03-04 22:44:03.001843: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235465b00 next 29 of size 256
2021-03-04 22:44:03.001854: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235465c00 next 30 of size 256
2021-03-04 22:44:03.001869: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235465d00 next 31 of size 256
2021-03-04 22:44:03.001880: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235465e00 next 32 of size 7680
2021-03-04 22:44:03.001890: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f3235467c00 next 34 of size 75008
2021-03-04 22:44:03.001901: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323547a100 next 36 of size 7680
2021-03-04 22:44:03.001912: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323547bf00 next 37 of size 75008
2021-03-04 22:44:03.001922: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323548e400 next 39 of size 256
2021-03-04 22:44:03.001933: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323548e500 next 40 of size 256
2021-03-04 22:44:03.001944: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323548e600 next 41 of size 256
2021-03-04 22:44:03.001954: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323548e700 next 42 of size 256
2021-03-04 22:44:03.001965: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323548e800 next 43 of size 256
2021-03-04 22:44:03.001975: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323548e900 next 44 of size 256
2021-03-04 22:44:03.001986: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323548ea00 next 45 of size 256
2021-03-04 22:44:03.001996: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323548eb00 next 46 of size 256
2021-03-04 22:44:03.002007: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323548ec00 next 47 of size 256
2021-03-04 22:44:03.002017: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x7f323548ed00 next 48 of size 256
2021-03-04 22:44:03.002028: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x7f323548ee00 next 18446744073709551615 of size 463360
2021-03-04 22:44:03.002038: I tensorflow/core/common_runtime/bfc_allocator.cc:914]      Summary of in-use Chunks by size: 
2021-03-04 22:44:03.002052: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 20 Chunks of size 256 totalling 5.0KiB
2021-03-04 22:44:03.002064: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 1280 totalling 1.2KiB
2021-03-04 22:44:03.002076: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 7 Chunks of size 7680 totalling 52.5KiB
2021-03-04 22:44:03.002088: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 7 Chunks of size 75008 totalling 512.8KiB
2021-03-04 22:44:03.002099: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 8 Chunks of size 140197888 totalling 1.04GiB
2021-03-04 22:44:03.002111: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 232554496 totalling 221.78MiB
2021-03-04 22:44:03.002123: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 256475136 totalling 244.59MiB
2021-03-04 22:44:03.002135: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 2 Chunks of size 268435456 totalling 512.00MiB
2021-03-04 22:44:03.002149: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 562723328 totalling 536.65MiB
2021-03-04 22:44:03.002162: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 5628133888 totalling 5.24GiB
2021-03-04 22:44:03.002176: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 7.77GiB
2021-03-04 22:44:03.002189: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 12885950464 memory_limit_: 14949928141 available bytes: 2063977677 curr_region_allocation_bytes_: 8589934592
2021-03-04 22:44:03.002207: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats: 
Limit:                 14949928141
InUse:                  8338926080
MaxInUse:               8338926080
NumAllocs:                  163856
MaxAllocSize:           5628133888

2021-03-04 22:44:03.002228: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ********************************************______________________******___________*****************
2021-03-04 22:44:03.002271: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at matmul_op.cc:480 : Resource exhausted: OOM when allocating tensor with shape[75150,18723] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[75150,18723] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node MatMul_3}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[75150,18723] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node MatMul_3}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Reshape_1/_109]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/www/vendor/biigle/maia/src/config/../resources/scripts/novelty-detection/DetectionRunner.py", line 130, in <module>
    runner.run()
  File "/var/www/vendor/biigle/maia/src/config/../resources/scripts/novelty-detection/DetectionRunner.py", line 78, in run
    threshold = self.process_cluster(detector, cluster)
  File "/var/www/vendor/biigle/maia/src/config/../resources/scripts/novelty-detection/DetectionRunner.py", line 92, in process_cluster
    saliency_map = detector.apply(image.path, available_bytes=self.available_bytes)
  File "/var/www/vendor/biigle/maia/src/resources/scripts/novelty-detection/AutoencoderSaliencyDetector.py", line 93, in apply
    self.chunk: chunk
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[75150,18723] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node MatMul_3 (defined at usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[75150,18723] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node MatMul_3 (defined at usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Reshape_1/_109]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Original stack trace for 'MatMul_3':
  File "var/www/vendor/biigle/maia/src/config/../resources/scripts/novelty-detection/DetectionRunner.py", line 130, in <module>
    runner.run()
  File "var/www/vendor/biigle/maia/src/config/../resources/scripts/novelty-detection/DetectionRunner.py", line 71, in run
    detector = AutoencoderSaliencyDetector(self.patch_size, stride=self.detector_stride, hidden=self.latent_size)
  File "var/www/vendor/biigle/maia/src/resources/scripts/novelty-detection/AutoencoderSaliencyDetector.py", line 42, in __init__
    self.element_wise_cost = self.autoencoder.element_wise_cost(self.reshape)
  File "var/www/vendor/biigle/maia/src/resources/scripts/novelty-detection/Autoencoder.py", line 84, in element_wise_cost
    encode, recon = self._plug(input_tensor)
  File "var/www/vendor/biigle/maia/src/resources/scripts/novelty-detection/Autoencoder.py", line 69, in _plug
    tf.add(tf.matmul(h, self.weights['recon'][layer]['w']),
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_ops.py", line 2754, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_math_ops.py", line 6136, in mat_mul
    name=name)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()
mzur commented 1 year ago

Maybe this can be fixed as part of #96.