deepfakes / faceswap-playground

User dedicated repo for the faceswap project
306 stars 194 forks source link

Unable to train(critical error) #270

Closed szwank closed 5 years ago

szwank commented 5 years ago

Training process crash.

Actual behavior

python faceswap.py train -A data/Maciek_broda -B data/Harrison_Ford -m models/ -p --trainer villain --batch-size 8 --save-interval 20 03/26/2019 11:21:22 INFO Log level set to: INFO Using TensorFlow backend. 03/26/2019 11:21:23 INFO Model A Directory: /home/szwank/Desktop/faceswap/data/Maciek_broda 03/26/2019 11:21:23 INFO Model B Directory: /home/szwank/Desktop/faceswap/data/Harrison_Ford 03/26/2019 11:21:23 INFO Training data directory: /home/szwank/Desktop/faceswap/models 03/26/2019 11:21:23 INFO ===================================================================== 03/26/2019 11:21:23 INFO - Using live preview - 03/26/2019 11:21:23 INFO - Press 'ENTER' on the preview window to save and quit - 03/26/2019 11:21:23 INFO - Press 'S' on the preview window to save model weights immediately - 03/26/2019 11:21:23 INFO ===================================================================== 03/26/2019 11:21:25 INFO Loading data, this may take a while... 03/26/2019 11:21:25 INFO Loading Model from Villain plugin... 03/26/2019 11:23:24 INFO Loading config: '/home/szwank/Desktop/faceswap/config/train.ini' 03/26/2019 11:23:25 WARNING No existing state file found. Generating. 03/26/2019 11:23:52 INFO Creating new 'villain' model in folder: '/home/szwank/Desktop/faceswap/models' 03/26/2019 11:24:18 INFO Loading Trainer from Original plugin... 03/26/2019 11:24:32 INFO Enabled TensorBoard Logging 03/26/2019 11:33:32 CRITICAL Error caught! Exiting... 03/26/2019 11:33:32 ERROR Caught exception in thread: 'training_0' 03/26/2019 11:33:35 ERROR Got Exception on main handler: Traceback (most recent call last): File "/home/szwank/Desktop/faceswap/lib/cli.py", line 107, in execute_script process.process() File "/home/szwank/Desktop/faceswap/scripts/train.py", line 101, in process self.end_thread(thread, err) File "/home/szwank/Desktop/faceswap/scripts/train.py", line 126, in end_thread thread.join() File "/home/szwank/Desktop/faceswap/lib/multithreading.py", line 443, in join raise thread.err[1].with_traceback(thread.err[2]) File "/home/szwank/Desktop/faceswap/lib/multithreading.py", line 381, in run self._target(*self._args, *self._kwargs) File "/home/szwank/Desktop/faceswap/scripts/train.py", line 152, in training raise err File "/home/szwank/Desktop/faceswap/scripts/train.py", line 142, in training self.run_training_cycle(model, trainer) File "/home/szwank/Desktop/faceswap/scripts/train.py", line 214, in run_training_cycle trainer.train_one_step(viewer, timelapse) File "/home/szwank/Desktop/faceswap/plugins/train/trainer/_base.py", line 139, in train_one_step loss[side] = batcher.train_one_batch(do_preview) File "/home/szwank/Desktop/faceswap/plugins/train/trainer/_base.py", line 214, in train_one_batch loss = self.model.predictors[self.side].train_on_batch(batch) File "/home/szwank/.conda/envs/deepfake/lib/python3.6/site-packages/keras/engine/training.py", line 1217, in train_on_batch outputs = self.train_function(ins) File "/home/szwank/.conda/envs/deepfake/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in call return self._call(inputs) File "/home/szwank/.conda/envs/deepfake/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call fetched = self._callable_fn(*array_vals) File "/home/szwank/.conda/envs/deepfake/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in call run_metadata_ptr) File "/home/szwank/.conda/envs/deepfake/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in exit c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3,3,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node training_1/Adam/mul_311}} = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Adam/beta_1/read, training_1/Adam/Variable_62/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[{{node loss_1/mul/_1601}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_7580_loss_1/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

03/26/2019 11:33:35 CRITICAL An unexpected crash has occurred. Crash report written to /home/szwank/Desktop/faceswap/crash_report.2019.03.26.113332875078.log. Please verify you are running the latest version of faceswap before reporting ^CTraceback (most recent call last): File "faceswap.py", line 36, in ARGUMENTS.func(ARGUMENTS) File "/home/szwank/Desktop/faceswap/lib/cli.py", line 120, in execute_script safe_shutdown() File "/home/szwank/Desktop/faceswap/lib/utils.py", line 209, in safe_shutdown terminate_processes() File "/home/szwank/Desktop/faceswap/lib/multithreading.py", line 488, in terminate_processes process.join() File "/home/szwank/Desktop/faceswap/lib/multithreading.py", line 221, in join if self._result_tokens.get() is None: File "", line 2, in get File "/home/szwank/.conda/envs/deepfake/lib/python3.6/multiprocessing/managers.py", line 757, in _callmethod kind, result = conn.recv() File "/home/szwank/.conda/envs/deepfake/lib/python3.6/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/home/szwank/.conda/envs/deepfake/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/szwank/.conda/envs/deepfake/lib/python3.6/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) KeyboardInterrupt Exception in thread Thread-1: Traceback (most recent call last): File "/home/szwank/.conda/envs/deepfake/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/szwank/.conda/envs/deepfake/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/szwank/.conda/envs/deepfake/lib/python3.6/logging/handlers.py", line 1476, in _monitor record = self.dequeue(True) File "/home/szwank/.conda/envs/deepfake/lib/python3.6/logging/handlers.py", line 1425, in dequeue return self.queue.get(block) File "", line 2, in get File "/home/szwank/.conda/envs/deepfake/lib/python3.6/multiprocessing/managers.py", line 757, in _callmethod kind, result = conn.recv() File "/home/szwank/.conda/envs/deepfake/lib/python3.6/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/home/szwank/.conda/envs/deepfake/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/szwank/.conda/envs/deepfake/lib/python3.6/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError

Exception ignored in: <generator object TrainingDataGenerator.minibatch at 0x7f7810b276d0> Traceback (most recent call last): File "/home/szwank/Desktop/faceswap/lib/training_data.py", line 135, in minibatch File "/home/szwank/Desktop/faceswap/lib/multithreading.py", line 43, in exit File "/home/szwank/Desktop/faceswap/lib/multithreading.py", line 35, in free File "/home/szwank/Desktop/faceswap/lib/multithreading.py", line 173, in free File "", line 2, in put File "/home/szwank/.conda/envs/deepfake/lib/python3.6/multiprocessing/managers.py", line 753, in _callmethod File "/home/szwank/.conda/envs/deepfake/lib/python3.6/multiprocessing/managers.py", line 740, in _connect File "/home/szwank/.conda/envs/deepfake/lib/python3.6/multiprocessing/connection.py", line 487, in Client File "/home/szwank/.conda/envs/deepfake/lib/python3.6/multiprocessing/connection.py", line 614, in SocketClient FileNotFoundError: [Errno 2] No such file or directory Exception ignored in: <bound method BaseSession.del of <tensorflow.python.client.session.Session object at 0x7f7810b394e0>> Traceback (most recent call last): File "/home/szwank/.conda/envs/deepfake/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 738, in del TypeError: 'NoneType' object is not callable

Other relevant information

operating system = Ubuntu 18.04.02 Graphic card = GTX 1060 6GB Ram 8GB

------PIP packages------ Package Version


absl-py 0.7.0
astor 0.7.1
certifi 2019.3.9 Click 7.0
cloudpickle 0.8.0
cycler 0.10.0
cytoolz 0.9.0.1 dask 1.1.4
decorator 4.3.2
dlib 19.17.0 face-recognition 1.2.3
face-recognition-models 0.3.0
ffmpy 0.2.2
gast 0.2.2
google-images-download 2.5.0
grpcio 1.16.1
h5py 2.9.0
imageio 2.5.0
Keras 2.2.4
Keras-Applications 1.0.7
Keras-Preprocessing 1.0.9
kiwisolver 1.0.1
Markdown 3.0.1
matplotlib 2.2.2
mkl-fft 1.0.10
mkl-random 1.0.2
mock 2.0.0
networkx 2.2
numpy 1.15.4
nvidia-ml-py3 7.352.0 olefile 0.46
opencv-python 4.0.0.21 pathlib 1.0.1
pbr 5.1.3
Pillow 5.4.1
pip 19.0.3
protobuf 3.6.1
psutil 5.6.1
pyparsing 2.3.1
python-dateutil 2.8.0
pytz 2018.9
PyWavelets 1.0.2
PyYAML 3.13
scikit-image 0.14.2
scikit-learn 0.20.3
scipy 1.2.1
selenium 3.141.0 setuptools 40.8.0
six 1.12.0
tensorboard 1.12.2
tensorflow-estimator 1.13.0
tensorflow-gpu 1.12.0
termcolor 1.1.0
toolz 0.9.0
tornado 6.0.1
tqdm 4.31.1
urllib3 1.24.1
Werkzeug 0.14.1
wheel 0.33.1
-------------Conda packages-----------

packages in environment at /home/szwank/.conda/envs/deepfake:

#

Name Version Build Channel

_tflow_select 2.1.0 gpu
absl-py 0.7.0 py36_0
astor 0.7.1 py36_0
blas 1.0 mkl
bzip2 1.0.6 h14c3975_5
c-ares 1.15.0 h7b6447c_1
ca-certificates 2019.1.23 0
certifi 2019.3.9 py36_0
Click 7.0 cloudpickle 0.8.0 py36_0
cmake 3.12.2 h52cb24c_0
cycler 0.10.0 py36_0
cytoolz 0.9.0.1 py36h14c3975_1
dask-core 1.1.4 py_0
dbus 1.13.6 h746ee38_0
decorator 4.3.2 py36_0
dlib 19.17.99 dlib 19.17.0 expat 2.2.6 he6710b0_0
face-recognition 1.2.3 face-recognition-models 0.3.0 ffmpeg 4.1 h6dce934_1002 conda-forge ffmpy 0.2.2 fontconfig 2.13.0 h9420a91_0
freetype 2.9.1 h8a8886c_1
gast 0.2.2 py36_0
glib 2.56.2 hd408876_0
gmp 6.1.2 hf484d3e_1000 conda-forge gnutls 3.6.5 hd3a4fd2_1002 conda-forge google-images-download 2.5.0 grpcio 1.16.1 py36hf8bcb03_1
gst-plugins-base 1.14.0 hbbd80ab_1
gstreamer 1.14.0 hb453b48_1
h5py 2.9.0 py36h7918eee_0
hdf5 1.10.4 hb1b8bf9_0
icu 58.2 h9c2bf20_1
imageio 2.5.0 py36_0
intel-openmp 2019.1 144
jpeg 9b h024ee3a_2
keras-applications 1.0.7 py_0
keras-base 2.2.4 py36_0
keras-preprocessing 1.0.9 py_0
kiwisolver 1.0.1 py36hf484d3e_0
krb5 1.16.1 h173b8e3_7
lame 3.100 h14c3975_1001 conda-forge libcurl 7.64.0 h20c2e04_2
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 8.2.0 hdf63c60_1
libgfortran-ng 7.3.0 hdf63c60_0
libiconv 1.15 h14c3975_1004 conda-forge libpng 1.6.36 hbc83047_0
libprotobuf 3.6.1 hd408876_0
libssh2 1.8.0 h1ba5d50_4
libstdcxx-ng 8.2.0 hdf63c60_1
libtiff 4.0.10 h2733197_2
libuuid 1.0.3 h1bed415_2
libxcb 1.13 h1bed415_1
libxml2 2.9.9 he19cac6_0
markdown 3.0.1 py36_0
matplotlib 2.2.2 py36hb69df0a_2
mkl 2019.1 144
mkl_fft 1.0.10 py36ha843d7b_0
mkl_random 1.0.2 py36hd81dba3_0
mock 2.0.0 ncurses 6.1 he6710b0_1
nettle 3.4.1 h1bed415_1002 conda-forge networkx 2.2 py36_1
numpy 1.15.4 py36h7e9f1db_0
numpy-base 1.15.4 py36hde5b4d6_0
nvidia-ml-py3 7.352.0 olefile 0.46 py36_0
opencv-python 4.0.0.21 openh264 1.8.0 hdbcaa40_1000 conda-forge openssl 1.1.1b h7b6447c_1
pathlib 1.0.1 py36_1
pbr 5.1.3 pcre 8.43 he6710b0_0
pillow 5.4.1 py36h34e0f95_0
pip 19.0.3 py36_0
protobuf 3.6.1 py36he6710b0_0
psutil 5.6.1 py36h7b6447c_0
pyparsing 2.3.1 py36_0
pyqt 5.9.2 py36h05f1152_2
python 3.6.8 h0371630_0
python-dateutil 2.8.0 py36_0
pytz 2018.9 py36_0
pywavelets 1.0.2 py36hdd07704_0
pyyaml 3.13 py36h14c3975_0
qt 5.9.7 h5867ecd_1
readline 7.0 h7b6447c_5
rhash 1.3.8 h1ba5d50_0
scikit-image 0.14.2 py36he6710b0_0
scikit-learn 0.20.3 py36hd81dba3_0
scipy 1.2.1 py36h7c811a0_0
selenium 3.141.0 setuptools 40.8.0 py36_0
sip 4.19.8 py36hf484d3e_0
six 1.12.0 py36_0
sqlite 3.27.2 h7b6447c_0
tensorboard 1.12.2 py36he6710b0_0
tensorflow-estimator 1.13.0 tensorflow-gpu 1.12.0 termcolor 1.1.0 py36_1
tk 8.6.8 hbc83047_0
toolz 0.9.0 py36_0
tornado 6.0.1 py36h7b6447c_0
tqdm 4.31.1 py_0
urllib3 1.24.1 werkzeug 0.14.1 py36_0
wheel 0.33.1 py36_0
x264 1!152.20180717 h14c3975_1001 conda-forge xz 5.2.4 h14c3975_4
yaml 0.1.7 had09818_2
zlib 1.2.11 h7b6447c_3
zstd 1.3.7 h0b5b093_0

Other information

I have checked photos size. They are equal. Similar issue happens when i try different model.

torzdf commented 5 years ago

OOM = Out of memory. Reduce batchsize or use a different model

Kirin-kun commented 5 years ago

The villain model won't work with a GTX 1060 6Gb.. There's not enough memory for, even with a batch of two. Trust me, I tried.

With the "memory saving gradients" option, it will train. But I guess it will take a lot longer.