training issue - Githubissues

kellurian commented 6 years ago

Expected behavior

Since the last several merges, I have been having issues with training after a certain point. I am using the multiGPU plugin and are training about 13,000 images, with about 3000 subject images, which is not a large set for me. There is no problem with beginning the training and it usually proceeds for about 8-24 hours until it gets to about 0.14 or so, and then it jumps to about .454 or such, and the preview windows for both the autoencoders go completely monocolor-red and do not improve after this point. I can start over training, but it generally happens at about the same timeframe. I have combed through the data sets looking for bad .png files and haven't found any. I can put this in the playground, but I thought it was relevant because I don't get an error, and it just started happening in the last 7-10 days with some of the new merges, though I am not sure which in particular. I haven't really changed anything I have previously been doing except making sure to keep updated with with master

Steps to reproduce- any training session that trains less than 0.015

Describe, in some detail, the steps you tried that resulted in the behavior described above.

Command lined used (if not specified in steps to reproduce): python faceswap.py train -A H:/faceswap/data/personA -B H:/faceswap/data/subject -m H:/faceswap/model -bs 256 -t Original -g 2 -p
Operating system and version: Windows 10
**Python version:3.6.4, ...
**Faceswap version: current master version
**Faceswap method:multiGPU
Other related issues: #123, #124...

bryanlyon commented 6 years ago

Hmmm, that is odd. That isn't something that I've seen and I can't think of any reason that your situation will be fundamentally different than mine.

Can you give me more info. I see you're using the original model with a BS of 256. I sthis always what you're running when you see this problem? Have you gotten anywhere else with the model once it happens or is it just done forever then? Are you running the 4k monitor while the problem happens? Have you tried running it at a lower resolution during the training? Are the drives connected with an SLI bridge? What kind (flex, hard, or High Bandwidth)?

My thoughts are that this is some sort of overflow or ram corruption. I'd suggest trying smaller batch sizes and with a lower resolution framebuffer. There are also some potential corruption issues I've run into when using SLI and CUDA. I suggest you disable SLI when training (possibly even remove any bridges).

If you're willing to share your models and data, I'd love to check it to see what is happening.

iperov commented 6 years ago

preview windows for both the autoencoders go completely monocolor-red

model overfitted ?

bryanlyon commented 6 years ago

No, model overfit will usually just make the model poor at conversion, not turn it all straight red. That seems to be a different problem.

iperov commented 6 years ago

I got red preview when experimented with number of conv layers.

kellurian commented 6 years ago

Yeah, I am running 4k with SLI and CUDA enabled and the bridge is a hard bridge, and I usually keep it enabled. The weird thing is that is that it just started happening. I have let it go a few hours afterward (usually cause I left it in place and didn't notice, and it doesn't seem to want to improve, just continue to stay the same. It doesn't seem to want to try to refit the model, just stays a red blank for both the A and B autoencoders). I used to let it run for days and have gotten it down to .009 without problems, but just this last week or two it seems to do this. I am not sure exactly what day it started because I am not training every day, but I have seen it on the last one or two weeks of model. Could it be something about the input (model A) data? We've been changing so many things (eye alignment etc) that I wonder if that could be the case? I really just wanted to know if anyone else was seeing this as well or if it was just me... I will look at getting some model data to check it to post to you if I can. The IAE model doesn't seem to do this, but I don't like the results so I switched back to the original, but there again I may not have trained this down to the same level.

kellurian commented 6 years ago

I decreased the batch size back down to 128 and haven't had the problem all day. Maybe that is the issue. Will keep training and update. Maybe its a memory overheating issue with the vram in the cards or something, though they are water cooled and such....

kellurian commented 6 years ago

Well, it somehow seems related to batch size. as long as I stay below 256 it doesn't happen. Don't know what that is. Thanks for the help guys looking into this. If I find anything else, I'll post. I'll go ahead and close this now.

Picslook commented 6 years ago

Did you ever figure out how to prevent this? I am still getting this same issue with dfaker

kellurian commented 6 years ago

Not really, though I haven’t had it happen with the original high res trainer. Bryanlyon thinks it is some issue with the communication between my gpu and the bus.

Sent from my iPhone

On Jun 27, 2018, at 1:48 PM, Picslook notifications@github.com wrote:

Did you ever figure out how to prevent this? I am still getting this same issue with dfaker

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

Picslook commented 6 years ago

Alright. On my part I only get it using dfaker; all other trainers work fine. Makes it difficult to reuse a model since it breaks before it reaches an acceptable level of progress. If I do a whole new model it breaks only after 2-3 days. Really don't know what's going on and how it could be related to the GPU bus, but thanks anyway.

kellurian commented 6 years ago

I haven't used dfaker yet, mainly just faceswap. I am sure it would probably happen to me. I don't really believe its the bus, I think its something software related, because I can run the originalhighres and I have not had it happen yet.

On Sat, Jun 30, 2018 at 10:35 PM, Picslook notifications@github.com wrote:

Alright. On my part I only get it using dfaker; all other trainers work fine. Makes it difficult to reuse a model since it breaks before it reaches an acceptable level of progress. If I do a whole new model it breaks only after 2-3 days. Really don't know what's going on and how it could be related to the GPU bus, but thanks anyway.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/deepfakes/faceswap/issues/310#issuecomment-401578883, or mute the thread https://github.com/notifications/unsubscribe-auth/AibL7V8THgElLBvTreqBh-pMAiLTvvgmks5uCDV0gaJpZM4S5nDj .

bryanlyon commented 6 years ago

I've run the original model to over 400k iterations. It didn't show this problem. However, as soon as I wiggled the GPU in the socket it corrupted similar to this. Other errors similar to this have always been tracked down to hardware issues.

https://github.com/tensorflow/tensorflow/issues/3912

We also had someone who had their image turning green, his problem was traced down to hardware issues caused by a factory overclock set by the manufacturer which was fine for gaming, but caused corruption in cuda compute. I highly recommend running your boards at stock (nvidia) speeds only, checking your connection with your motherboard, and making sure that the power supply is sufficient.

The fact that your lower batch sizes fixed it pretty much confirms (to me) that it's a ram/bus issue since you're now sending FAR less data per pulse across the bus to the ram. Feel free to use Original/highres if that works for you though.

kellurian commented 6 years ago

No offense intended but that could be a conformation bias. I know you know loads more about software than me but I am a hardware guy. I have built countless systems from the ground up and have not seen this error in using the pcie bus before despite massive gpu loads. In addition, while you could be correct in that there are errors that occur between the seating, the error I am seeing is not what is described in your link. I don’t get corruption of the video. I get complete wipeout of the image. Also, people seem to get it a lot with dfaker and that is extremely low batch sizes typically. Also, when face swap was used on my system up until about late March I never had the issue despite using up to 280 batch sizes without problems. It happened after I updated my version to the github repo at that time. I suppose it’s moot anyway since if your right I can’t do anything about it- I did try to replicate it with moving around the cards and wiggling them- it didn’t change anything. I checked all your recommendations-gpu temp, physically undoing the gpu bridge and disabling the sli option in nvidia software and it still happened with the original model. No overclocking here either. I also used extreme testing in sisoft for hours and no errors. It is probably a two-hit scenario- something in the code is overdoing certain motherboard buses

Sent from my iPhone

On Jul 1, 2018, at 2:07 AM, bryanlyon notifications@github.com wrote:

I've run the original model to over 400k iterations. It didn't show this problem. However, as soon as I wiggled the GPU in the socket it corrupted similar to this. Other errors similar to this have always been tracked down to hardware issues.

tensorflow/tensorflow#3912

We also had someone who had their image turning green, his problem was traced down to hardware issues caused by a factory overclock set by the manufacturer which was fine for gaming, but caused corruption in cuda compute. I highly recommend running your boards at stock (nvidia) speeds only, checking your connection with your motherboard, and making sure that the power supply is sufficient.

The fact that your lower batch sizes fixed it pretty much confirms (to me) that it's a ram/bus issue since you're now sending FAR less data per pulse across the bus to the ram. Feel free to use Original/highres if that works for you though.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

deepfakes / faceswap

training issue #310

Expected behavior

Steps to reproduce- any training session that trains less than 0.015