Open adamgeddon1686 opened 1 year ago
Hi and thank you!
So for the detection you can try tweak a couple of things. First of, the confidence threshold can be changed as you mentioned. This could be added in the RetinaFaceModel by using tf.gather with indices identified by desired condition, such as where classification_score > 0.6. Second way of tweaking is trying to change the resolution of the input image. The prediction is normalized if I don't misremember. So the positions of the landmarks can just be scaled by the original frame resolution.
Yes, feel free to share your ideas. Always nice to take in others ideas!
Thanks for getting back to me. So my theory about improving this were reached in an informal fashion. I actually started using your program because the other one that I used had an issue with memory management which was constantly causing the colab environment to crash. Your code doesn't seem to have this issue since it is mostly done on VRAM. This has allowed me to experiment with all sorts of inputs to see which makes the best end product. First and foremost, selection of your video clip is going to be key. No skinny faces on round faces or long faces on short faces. There isn't a swap program in existence that can fix this unless you are swapping the entire head. Beyond that there were two things that I have found have the greatest impact on the final product and that is the video framerate and the clarity of the facial landmarks, both of which can be enhanced as part of a preprocessing that I do.
So the first thing that I do is break the video in to frames and upsample the faces using GFPGAN. There are other upsampling methods but I find that this one makes the least changes while still increasing the clarity of landmarks. This is important with videos since you need continuity between frames. Then I increase the framerate of the video. My test set has been all cell phone videos so they are mostly at 30fps. Some of them work fine but I found that increasing it to 120fps makes it so that rapid motions and blinking don't seem to cause microstutters like they do in lower fps videos. For this I use RIFE since it seems to create the smoothest motion and seems to eliminate any awkwardness created while upsampling the faces as well.
Next you run the facedancer inference which should produce a much smoother and clearer face swap. But I'm not done yet. After this I drop the video's framerate down to 30fps again, giving RIFE yet another opportunity to smooth the motion and stutters. If it looks good there, you're good to go. If not, you can repeat the step you did on the source video to clean up the swapped face and raise the framerate to smooth out the changes. After that if you are bold and like super HD content, run any upscaler you like. RealSR in combination with all of this turn cell phone videos into swapped 4k output that runs at 120fps. If you don't want that big of a video file, you can use a different upscaler or tweak the fps of the final step. The most important part is to use rife to "fill in the blanks" and upsampling to clarify the landmarks.
My idea was that I could take this a step further in post processing and implement some kind of check to eliminate "bad frames" using something like SSIM loss and then using something like rife to fill in the holes this creates but I haven't found a good way of doing this just yet. If you have any ideas or would like to discuss this further, let me know.
I am glad it works well. It is true that one of the bigger challenges is dealing with face shape, which as you put it is diifficult when one face is skinny vs round or e.g long nose vs small nose. Forcing the model to make inpainting decisions that easily causes distortions. HifiFace arguable deal with this by using 3DMM, however to qoute one of the authors, works best on images and struggle on video.
The landmark stability is a good point. You could probably alleviate this further by using a ROI-base 68 landmark predictor for alignment after detection. However, in my recent work I have introduced small distortions such as scale and rotation to the target during training. Now, the end-result is more stable in this regard, but I am not sure what contributed to this the best.
Anyways, your pre-processing is interesting. Cool that it helps. Do you perhaps have a video or two to show together with results if skipping your pre-processing? I am curious to see the difference. :)
The bad-frame detection is an interesting idea. Some spontaneous ideas I get is, maybe some kind of perceptual loss using ArcFace (assuming you look at the faces only). Maybe instance selection could be used? Or maybe plain old anomaly detection auto-encoder trained on reconstructing 2-3 frames.
Or perhaps PSNR could be utilized?
Hey thanks for getting back to me. Sorry I am juggling several projects right now. First things first, I think that I should point out that I am something of a newbie to coding and programming so I might not be able to contribute much on some fronts. I will make a sample of the results with preprocessing and one without to show you what I mean. I have plenty with the preprocessing but it has been while since I ran it without it. The biggest thing that it helps with is rapid movements, like head turns and blinking that might have resulted in blurry frames in the original recorded video and thus bad detection by the swapping software. As is probably obvious, interpolating new frames makes your facedancer program take much longer to execute though and the preprocessing itself isn't quick but I think it is worth the effort.
As for the idea about bad frames, that, I believe, is like the holy grail of all video editing and could probably be made useful far outside of this application. By doing the unguided post processing (ie using rife to interpolate down and then back up and just hoping it picks the correct frames to delete/add) it already produces a much better end result. I could imagine that if this was more targeted it could be used to fix a variety of bad outputs produced when edits are made to a video. You would likely be better equipped to figure this out. My ideas just stem from the way that ssim loss is used in other face swapping programs for the purpose of training but I haven't heard of it being used within videos to find bad frames like I suggested. My vision would be something of a hybrid approach between your one shot method and the other trained methods but it is only an idea. I am not sure how much I could contribute in terms of execution of said idea besides testing.
I have gotten quite good at writing automation scripts though. For instance, here is your Juypter notebook for colab with a few layers of automation added to it and a bit of user guide I tried to gin up while I was bored at work. Feel free to try it out and share it if you would like. FaceDancerJuypter.zip
Hello, sorry for late reply. I am also juggling several projects at the moment. I will try find the time to play around with your notebook.
I still haven't solved the issue with false positives. I still get the occasional ghost face painted on to clothing for some reason. Oddly enough I also use codeformer, which use the same RetinaFace model but on pytorch and it doesn't seem to detect a face in those frames. I had an idea of perhaps restricting your inference somehow to only doing one face in the photo the way that codeformer does but I am not really sure where to begin in trying to make this change. All of the videos I am working with are single subject videos, so it really should only ever be painting one face per frame anyway in my case. Maybe inside your utility function where it extracts faces, I could somehow make it so that it only extracts one face per frame? Even better if I could somehow make it so that it only extracts the face that it is most confident is a face. That would, I think, eliminate that in at least single subject videos.
I was just looking through some of the code and I am guessing that this is the part of the RetinaFace code that would need to be adapated so that the selected indices are done in a more discriminating fashion. Any suggestions on how to do so?
preds = tf.concat( # [bboxes, landms, landms_valid, conf]
[bbox_regressions[0],
landm_regressions[0],
tf.ones_like(classifications[0, :, 0][..., tf.newaxis]),
classifications[0, :, 1][..., tf.newaxis]], 1)
priors = prior_box_tf((tf.shape(inputs)[1], tf.shape(inputs)[2]), cfg['min_sizes'], cfg['steps'], cfg['clip'])
decode_preds = decode_tf(preds, priors, cfg['variances'])
selected_indices = tf.image.non_max_suppression(
boxes=decode_preds[:, :4],
scores=decode_preds[:, -1],
max_output_size=tf.shape(decode_preds)[0],
iou_threshold=iou_th,
score_threshold=score_th)
out = tf.gather(decode_preds, selected_indices)
So anyone else experiencing this issue I have an update on a possible solution. It is inelegant as it will now only swap a single subject and it doesn't really discriminate on which subject that is. If it selects the false positive as annotation[0], then it will only swap that which wouldn't work but so far it hasn't. I am still trying to figure out how to make it so that retina face either ranks the outputs or ideally uses face detection to make sure that the target is annotation zero and then the swap_func could then enrure it only swaps that one.
(swap_func.py at line 56) for annotation in faces: if len(faces) > 0: annotation = faces[0] lm_align = get_lm(annotation, im_w, im_h)
So I get a bit of an odd issue where folded clothing sometimes comes up as a false positive in frames of videos causing the program to try and paint the source face onto the false positive, leading to odd flashing in the final video since it only happens on intermittent frames. I was wondering if there was a way to adjust the confidence ratio for the face detection to make it a bit more discriminating or if there is an alternative approach that will eliminate this error. I should point out that this only seems to be an issue when working with very high resolution videos.
Also, if you want to discuss I would like to share some of my informal research on what makes for a better source video that may help you in improving this. There are a few steps that I have found that you can take before running the swap to ensure that the swap will be more likely to be successful. Let me know if you would like to discuss further.
Oh, I also rewrote your colab notebook so that it will loop through an input folder of images and videos all at once and then deposit them in chosen folder. It requires you to build the facedancer program as a zipfile by downloading the zip from github and then adding the pretrained models to it from hugging face and upload it to the parent directory of their drive and it will install it from there instead of downloading it. The entire notebook is a single cell. It also gives you a drop down for the different pretrained models and fields where you can enter your input and output folders. I can share that with you as well if you would like.
Anyway, thanks for your work. I enjoy messing around with the program. Look forward to hearing back.