continue training - Githubissues

wgfstar commented 3 years ago

Excuse me,how do I continue training which based on the saved model (for example, now trained to model 47) ，what do i do?

TachibanaYoshino commented 3 years ago

Keep the name of the checkpoint model folder the same as the corresponding training hyperparameter, and execute the training script to continue the future training.

wgfstar commented 3 years ago

i have run main.py

Instructions for updating: Use for ... in dataset: to iterate over a dataset. If using tf.estimator, return the Dataset object directly from your input function. As a last resort, you can use tf.compat.v1.data.make_one_shot_iterator(dataset). [] Reading checkpoints... [] Success to read checkpoint\AnimeGANv2_Chinese_lsgan_300_300_1_3_10_1_lite\AnimeGANv2.model-47 [*] Load SUCCESS 2021-02-20 16:52:53.231949: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll 2021-02-20 16:52:53.580937: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library [cudnn64_7.dll]

tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[10,128,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node discriminator_3/conv_s1_1/Conv2D (defined at D:\anaconda\envs\tensorflow-gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[add_16/_763]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[10,128,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node discriminator_3/conv_s1_1/Conv2D (defined at D:\anaconda\envs\tensorflow-gpu\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored.

cann't continue train!!!

TachibanaYoshino commented 3 years ago

Graphics card storage overflow means that the GPU memory is not enough. The batch size is set too large, or the number of convolution kernels is set too much.

TachibanaYoshino commented 3 years ago

I have just rewritten the lite model structure of the generator network, and it has achieved great results in my experiments.

wgfstar commented 3 years ago

I have adjusted batch size smaller,but it also the wrong. so,could you show me the rewritten network?i will appreciate if you can.

TachibanaYoshino commented 3 years ago

I have updated and submitted generator_lite.py, you can check it directly. Of course, it is recommended that you pay attention to （watch） this repository. If there are other update submissions, you can directly receive email notifications. The error occurs because the GPU storage space is insufficient. If your GPU is still supporting other application services, then the remaining space may not be enough for the use of AnimeGANv2. It is no longer necessary to adjust the batch size or reduce the number of convolution kernels at this time. You can change to a GPU with a larger storage space.

wgfstar commented 3 years ago

ok,i know,thanks

TachibanaYoshino / AnimeGANv2

continue training #22