Your GCP script - Githubissues

IEWbgfnYDwHRoRRSKtkdyMDUzgdwuBYgDKtDJWd commented 4 years ago

First of all, much thanks and appreciation for your repo, your script for GCP setup worked like a charm.

Only issue is when I try to train a new model using a custom dataset, it errors about 20 minutes after the first tick. also seems to have initial sample fake outputs as human faces (my dataset isnt faces). Unsure if this is normal or if I am doing something wrong.

dvschultz commented 4 years ago

can you post the error you‘re getting.

Seeing faces for your first fake is correct. This is using transfer learning (you can look it up on my youtube page and learn more about the technique there). Those faces get erased completely after a handful of ticks.

ahmedshingaly commented 4 years ago

Thank you very much for the great tutorial my GPU is running out of memory and failing in the beginning, appreciate if you take a look at the error bellow

her is my error `2020-05-20 11:18:44.904340: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 36.00MiB (rounded to 37748736). Current allocation summary follows. 2020-05-20 11:18:44.914558: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ****x*x 2020-05-20 11:18:44.919358: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at transpose_op.cc:199 : Resource exhausted: OOM when allocating tensor with shape[3,3,512,4,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call return fn(args) File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,3,3,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node GPU0/G_loss/G/G_synthesis/128x128/Conv1/Square}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "run_training.py", line 192, in main() File "run_training.py", line 187, in main run(vars(args)) File "run_training.py", line 120, in run dnnlib.submit_run(kwargs) File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\submission\submit.py", line 343, in submit_run return farm.submit(submit_config, host_run_dir) File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\submission\internal\local.py", line 22, in submit return run_wrapper(submit_config) File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\submission\submit.py", line 280, in run_wrapper run_func_obj(*submit_config.run_func_kwargs) File "C:\Users\USER6459\Documents\python\stylegan2\training\training_loop.py", line 299, in training_loop tflib.run(G_train_op, feed_dict) File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\tflib\tfutil.py", line 31, in run return tf.get_default_session().run(args, **kwargs) File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 950, in run run_metadata_ptr) File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run run_metadata) File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,3,3,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node GPU0/G_loss/G/G_synthesis/128x128/Conv1/Square (defined at C:\Users\USER6459\Documents\python\stylegan2\training\networks_stylegan2.py:104) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Errors may have originated from an input operation. Input Source operations connected to node GPU0/G_loss/G/G_synthesis/128x128/Conv1/Square: GPU0/G_loss/G/G_synthesis/128x128/Conv1/mul_3 (defined at C:\Users\USER6459\Documents\python\stylegan2\training\networks_stylegan2.py:100)

Original stack trace for 'GPU0/G_loss/G/G_synthesis/128x128/Conv1/Square': File "run_training.py", line 192, in main() File "run_training.py", line 187, in main run(vars(args)) File "run_training.py", line 120, in run dnnlib.submit_run(kwargs) File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\submission\submit.py", line 343, in submit_run return farm.submit(submit_config, host_run_dir) File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\submission\internal\local.py", line 22, in submit return run_wrapper(submit_config) File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\submission\submit.py", line 280, in run_wrapper run_func_obj(submit_config.run_func_kwargs) File "C:\Users\USER6459\Documents\python\stylegan2\training\training_loop.py", line 220, in training_loop G_loss, G_reg = dnnlib.util.call_func_by_name(G=G_gpu, D=D_gpu, opt=G_opt, training_set=training_set, minibatch_size=minibatch_gpu_in, G_loss_args) File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\util.py", line 256, in call_func_by_name return func_obj(*args, kwargs) File "C:\Users\USER6459\Documents\python\stylegan2\training\loss.py", line 152, in G_logistic_ns_pathreg fake_images_out, fake_dlatents_out = G.get_output_for(latents, labels, is_training=True, return_dlatents=True) File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\tflib\network.py", line 221, in get_output_for out_expr = self._build_func(final_inputs, build_kwargs) File "C:\Users\USER6459\Documents\python\stylegan2\training\networks_stylegan2.py", line 238, in G_main images_out = components.synthesis.get_output_for(dlatents, is_training=is_training, force_clean_graph=is_template_graph, kwargs) File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\tflib\network.py", line 221, in get_output_for out_expr = self._build_func(final_inputs, build_kwargs) File "C:\Users\USER6459\Documents\python\stylegan2\training\networks_stylegan2.py", line 498, in G_synthesis_stylegan2 x = block(x, res) File "C:\Users\USER6459\Documents\python\stylegan2\training\networks_stylegan2.py", line 470, in block x = layer(x, layer_idx=res2-4, fmaps=nf(res-1), kernel=3) File "C:\Users\USER6459\Documents\python\stylegan2\training\networks_stylegan2.py", line 455, in layer x = modulated_conv2d_layer(x, dlatents_in[:, layer_idx], fmaps=fmaps, kernel=kernel, up=up, resample_kernel=resample_kernel, fused_modconv=fused_modconv) File "C:\Users\USER6459\Documents\python\stylegan2\training\networks_stylegan2.py", line 104, in modulated_conv2d_layer d = tf.rsqrt(tf.reduce_sum(tf.square(ww), axis=[1,2,3]) + 1e-8) # [BO] Scaling factor. File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 10698, in square "Square", x=x, name=name) File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func return func(args, **kwargs) File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 3616, in create_op op_def=op_def) File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in init self._traceback = tf_stack.extract_stack()`

dvschultz commented 4 years ago

can you tell me what GPU you’re using, the CUDA version you’re running and what NVIDIA driver is running?

ahmedshingaly commented 4 years ago

I use workstation RTX 2080 Ti 11GB X1 and Ram is 128GM CUDA version is 10.0 and even if I reduce batch size it still fails due to space lack my dataset is 500 pictures 1024x1024 i tried bigger data set and smaller dataset and tried png, jpg, I tried all config avilable (a , b , c ,d ,e , f) I tried running it on another computer with GTX 1060 and cuda fails I tried 512 by 512 image dataset I tried 256 x 256 image dataset

< ERROR CUDA RUN OUT OF MEMORY > please note, I am trying to build my own models on building shapes not human faces thank you in advance

dvschultz / stylegan2

Your GCP script #2