User Help: CPU extraction

jslurm commented 6 years ago

Posting this here since it's more of an end-user/technical help question. Also, a disclaimer that I'm not very knowledgeable at all when it comes to Linux, Python, or anything command line related, and I'm still figuring out how to use git and github so a bit of hand-holding may be helpful.

Thanks to the tutorial here and a lot of trial and error, I've managed to get training working on my RX 480 using Ubuntu 16.04. Training is a bit slow, but works reasonably well. But right now, my biggest problem is when extracting faces.

It's my understanding that dlib just doesn't work with AMD GPUs, so my only option for extraction is to use the CPU (I've got a Ryzen 5 1600 with 16GB ram). Although the HOG detector is much faster, I'm not really a fan because it misses a lot of faces that aren't facing perfectly forward. Before the switch to the Keras port of 'face_alignment' I was able to use the CNN detector with -j 6 (6 processes/threads). It was slower but still reasonable if there weren't too many images. The results weren't always great, though, with several misaligned and over-cropped faces if they were at an angle.

I know the new 'face_alignment' Keras port is supposed to be better (and faster?), but I can't seem to get any results using my CPU. If I try using multiple processes it eats all my system memory and either crashes, or spits out a bunch of errors and still goes really slow. So is there a way to make this work better with my CPU, or am I completely out of luck?

I'd seen mention of compiling dlib to use AVX instructions, which I tried but I have no idea if I did it properly. I cloned the davisking repo into a new directory inside the faceswap directory (and in my active virtual environment), then used the commands to compile using AVX, and it appeared to install correctly but I have no idea if the faceswap scripts are using this or the version I already had installed.

Anyone out there have any ideas on how I could improve CPU performance for extraction? Or, alternatively, any face extraction tools that could work with an AMD GPU?

gessyoo commented 6 years ago

I'm not an expert either, but I spent a lot of time trying to find ways to speed up the extract/align process on Win 10 x64 with CUDA. I ended up re-compiling Dlib and OpenCV to use CUDA, but I think that there may be options to compile Dlib and OpenCV with OpenCL rather than CUDA for AMD GPUs. I suggest that you use the Cmake GUI version, which makes it easier to see and choose the compile options. available.

bryanlyon commented 6 years ago

My best suggestion to improve extraction speed is to use the --skip-existing option. Start with hog then redo it with CNN, it will only look at images that hog missed. You can use -j but you do need a lot of ram for each thread. We can work on improving ram usage. It'd help if you can you give me the specific errors you're getting.

torzdf commented 6 years ago

Honestly, the easiest way would be to have 2 versions of faceswap. One for extraction and one for training/convert.

For the extraction version grab the faceswap for the commit just before the Keras port.

go to the parent of your current faceswap directory (so if you have ~/faceswap, go to ~/)

git clone https://github.com/deepfakes/faceswap.git faceswap_extract

Change to the pre-face-alignment commit.

cd faceswap_extract
git checkout 6f2d260591b830b4230bcdc3aa20bb3623883172

You should now be able to run extract from the faceswap_extract directory and training from the faceswap directory all from within the same virtualenv

NB: The face alignment directory is a snapshot of that time, so any functionality added since that commit will not be available for extraction.

jslurm commented 6 years ago

@bryanlyon : I wasn't able to reproduce the first error it gave me when it crashed, but it had something along the lines of "bad_alloc" which I understood to have something to do with memory usage. I'll try to give as much info about the current behavior. When I try it now, I get the following: Info: initializing keras model... (repeated 4x) [date & time]: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices (repeated 4x) [date & time]: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Ellesmere, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE (repeated 4x)

Then it looks like it starts extracting, then after a few frames are processed it spits out the previous two lines again, then hangs for a while, then gives the following errors: OpenCL error Error: [ComputeCpp:RT0500] Failed to create buffer CL_INVALID_BUFFER_SIZE (repeated dozens of times) Error detected Inside Sycl Device. (repeated dozens of times) It tries to process more frames, then starts giving those two errors again, this time alternating between each one for what looks like several dozen times (more lines than I can easily count) Then it goes back to trying to process the frames, and I assume that cycle would continue if I didn't kill the process.

jslurm commented 6 years ago

@gessyoo As far as I can tell, there is no way to compile Dlib with OpenCL. And it seems like the developer has no interest in supporting AMD gpus at the moment. Unless someone else out there has cracked it, it seems like GPU extraction is only for Nvidia users.

jslurm commented 6 years ago

@torzdf Thanks for the tip. I had been using git checkout to switch to that branch for extraction, then using git pull origin master to switch back to the main branch for training and conversion. Your method is probably better/safer (again, still figuring out my way around git(hub)). Though ultimately I'd rather use the current keras model if possible, since it seems to have fewer problems aligning and cropping more angled faces.

torzdf commented 6 years ago

Yeah, as @enniowatson says, I think you're out of luck. There doesn't seem to have been any work done to port dlib to AMD/OpenCL, so you will only be able to use it with NVIDIA cards or CPU for now.

jslurm commented 6 years ago

I've been trying @bryanlyon 's suggestion to extract using HOG first and then CNN, while using -s to skip extracted frames. This is probably the best solution for me to use the keras port, however I'm running into some problems.

First, it's still very slow. I had a set of 699 frames with 1 face in each frame. HOG found 504 in less than ten minutes. CNN with 4 processes took over two hours to find most of the remaining frames, but got hung up at around 95%, at which point I had to ctrl+c to kill the process because I had other things to do. And even though it was 95% complete there were still 63 (of the remaining 195) faces not found.

I wonder if, even though it's not extracting existing faces, it's still checking every frame for faces it might have missed, because at no point did it quickly rush past a number of files that were already extracted. It also looks like it's extracting the files in a random order when I check the aligned folder during the process.

The other problem I noticed is that the alignments from running CNN were not being written to the existing aligned.json file. So even though those extra faces were extracted and trained, they could not be converted. I'm not sure if this is because I killed the process before it was finished, so I'll try it again and update if there is any change, but I wanted to let folks here know in case it is a possible issue.

bryanlyon commented 6 years ago

Yes, --skip-existing will recheck all files that haven't yet found a face. This is a deliberate decision. If you want to, you split up your folders into manageable chunks and use the --skip-existing to add them bit by bit. This will update the alignments.json while keeping the run down to small enough chunks. They are done in a "random" order. It's actually the order that the OS feeds the files in, we don't sort on the extract (though you're welcome to send a PR if you want it to happen in order). Force killing the extract will prevent it from writing out it's data, since we don't capture the ctrl-c break on extract, this may be something worth a fix.

It does skip the already existing files as long as they exist in the aligned folder. If you're having problems with this, give me more info by checking to ensure that the count matches and post the logs.

Also, yes CNN is very slow, this is why I suggested running it only after running the much faster hog.

jslurm commented 6 years ago

Thanks for the feedback, @bryanlyon , I guess I'm still figuring out what the "expected" behavior is for my particular setup. On subsequent tries it looks like ctr+c does in fact write the alignments to the .json file, I think it may have just hit an error on that first attempt.

Not sure if I was clear enough but what I mean with --skip-existing is that it looks like it's still rechecking files that have already found a face, to see if there are any other faces that it didn't find. For me, using it does not seem to really impact the time it takes to extract in any way. It will say "excluding x files" when initializing but still run through all the files in the target folder. In fact I just tried this on a folder with 606 images, all extracted with CNN & 3 processes (took about 2 hrs 20 mins the first time) and while I only let it run for a few minutes this time it seemed to be going at pretty much the same rate. (Sorry if I'm a little slow on this but I'm not entirely sure what "logs" I should post to give more info...)

As of right now it seems like my best option is to run the extraction with HOG, then manually move the skipped frames into a separate folder and extract those using CNN. And of course finding sources that don't have too many odd angles helps too.

EDIT: I also wanted to add that any RAM usage improvements would probably be a great help, like you mentioned. Sounds like there's some progress on that front so I'll keep an eye and see where that goes.

deepfakes / faceswap-playground

User Help: CPU extraction #105