OndrejTexler / Few-Shot-Patch-Based-Training

The official implementation of our SIGGRAPH 2020 paper Interactive Video Stylization Using Few-Shot Patch-Based Training
618 stars 107 forks source link

How to turn off the use of mask input? #9

Closed MazeRuiZhang closed 3 years ago

MazeRuiZhang commented 3 years ago

I suppose the mask layer is not a required thing, right? If so, how can I turn off the use of it when training and inferencing? I tried to comment the line of mask dir in the YAML config file, but it seems not working.

Is there a simple way to do that? Or should I modify the codes of files as data.py, trainers.py and train.py to remove the use mask layer?

Also, thank you so much for your excellent work of this project! With your recent update of temporal consistency, I improved my video output result a lot!

Best,

Ray

OndrejTexler commented 3 years ago

Hello Ray.

Thank you for your kind words! I am really glad that the temporal consistency tools were of help to you.

1) Inference The mask is not used during the inference. The inference is done on full images, using DatasetFullImages from data.py code. Although there is some mask functionality implemented, every time the DatasetFullImages is created, the "ignore" string value is passed as a dir_mask parameter. (Do not get confused, in some _gen dirs in testing-data.zip there is mask folder, but the masks are used for something else, not to do the inference.)

2) Training The mask there is used in a specific way. The training is not done on entire images; instead, it is done on small rectangular patches cropped from the training image(s). The mask image specifies from what parts of the training image to sample patches during the training. It means if you provide completely white image as a mask image, it will sample from entire training image. It is by default, and in most cases, desired behavior. I can tell you what in the code to modify to make masks optional and sample from entire training image by default, but much easier would be if you just create white .png images of correct resolution with correct names and place them in mask folder. In testing-data.zip, see, for instance, Lynx_train/mask, the masks there are white images.

3) [Optional] Gaussian In some _gen folders, there are masks. They are not used during inference nor during training. They are used in gauss.exe to generate optional gaussian images (input_gdisko_gauss_).

I hope that make sense! Let me know if there is anything else I can clarify. Best of luck with your work!

MazeRuiZhang commented 3 years ago

Hi Ondrej,

Thank you so much for your detailed and timely reply with my question! And yes, I am using the pure white pics as my mask files and they work pretty well.

By the way, for the models’ part, have you already tried implementing maxout or dropout like 0.3 or 0.5? If so, how about the performance? If not, is it a good idea for me to try adding them?

Best, Ray

OndrejTexler commented 3 years ago

Hello Ray.

That is a really good question/suggestion to try dropout. At the very beginning of our project, when we were trying to train image-to-image network using just a single training exemplar. We were, of course, dealing with strong overfitting problem. We tried various dropout and data augmentation techniques to make the network to generalize better to unseen frames/content. We were able to get better results, but definitely not good ones. So continued exploration and we developed patch-training; then, we were able to get really good results.

But we did not try to combine patch-training with dropout nor data augmentation. We did not try it mainly because of two reasons. 1) With patch-training, the network was not overfitting, and the results we were getting, the level of generalization, was satisfactory for our usecase, so we did not feel the urge to dig deeper and try to make it even more robust to unseen data. 2) This is more of a thought/intuition. If the patch-based training is done on patches of size, for example, 32x32 px. There is huge zero-padding happening in the model. There is many convolutional layers and all of them have zero padding. On full image, this zero padding would affect only few pixels around the edges, but if the patch of size 32x32 is fed thru the network, all pixels will be heavily affected, meaning, some information will be lost. And as the patches are sampled randomly, different information is lost every time. So thanks to the patch-training, there already is a "form of dropout".

Anyway, if you want the patch-training network to generalize to unseen data better, the dropout might be a good thing to try. We did not have motivation to try it for the two reasons above.

If you want to dig even deeper. You can try to use some kind of segmentation as an auxiliary channel during the training (in the same way as gaussian images are used).

I will mark the issue as resolved, but we can, for sure, continue our discussion here!

MazeRuiZhang commented 3 years ago

Hi Ondrej,

Thank you for your reply with my question on the dropout. I have also checked the visual result during my training, and yes, it always seems improving although slower in the later stage. And the generalization ability of the patch-based training is really inspiring! I will find and read some relevant references later!

Now, I am working on an animation project with an artist see how the few-shot model performs. The input and out sequence of ours are 1920x1080 pixel. As you mentioned in your paper, the technique still faces challenge when processing high-resolution images. To mitigate the problem may occur, I have tried several different settings with the hyper-parameters, like bigger patch size, larger batch size, generator and discriminator with more filters, different normalization layers, and etc. By spending hundreds of hours, still not a single setting offers a very satisfying result.

Based on my experience with the model, I guess the hyper-parameters should also be determined with the context of the images and drawings. Like different objects size ratio in the picture, different strokes and etc., besides the resolution itself all have influence with the best hyper-parameter setting, right? But I am doing my training and inferencing job with a single Tesla T4 GPU, so I couldn’t try all potential settings. So, could I have some, if any, suggestions for the hyperparameter tuning and setting part for the high-resolution image task?

Thank you so much!! Ray

OndrejTexler commented 3 years ago

Hello Ray.

Wow, it seems that you have tried quite a few things already!

I agree with everything you said, and the experiments you did make sense to me (tune patch-size, batch-size, more filters, norm layers, etc.,). If nothing worked, I fear that I do not have any other simple ideas to try. It seems that a more complex change in the network architecture or input data format will be required. Multiscale generator is quite popular and has been successfully used in image-to-image translation tasks, see Fig.3 in Pix2PixHD. However, this would not be easy to implement, but it would be a nice CVPR2021 paper :-) Beside that, I have one more idea, as I said in the previous message

If you want to dig even deeper. You can try to use some kind of segmentation as an auxiliary channel during the training (in the same way as gaussian images are used).

Using segmentation (or some other auxiliary information) would certainly be easier to try than doing any changes in the network architecture, but it might not work well - it is just an idea :-)

And yeah, you are right, the best hyper-parameter settings depends on particular content and style (e.g., if there are really large brush-strokes in the style, you need large patch-size to fully capture this). However, we found that quite good default hyper-parameters can be estimated, and that they work for large variety of styles/content.

Best of luck!!

MazeRuiZhang commented 3 years ago

Hi @OndrejTexler,

Thank you for your reply in another thread! I finally switched my development OS to Windows and all is good again.

During the last month, I changed the model part with many efforts and at last the generated video quality became better even with 1920x1080 format. If you are interested in what specific changes are applied, just feel free to let me know. Now, maybe the only and last challenge for me is the video flickering.

As the training time increases, the output quality becomes better definitely. But also the flickering effect becomes more obvious. I used tools of Bilateral Filters and Gauss extra inputs you provided, and they do work well as expected. (BTW, it seems the OpenCV 4.5.1 generated better filtered output results and Gauss pictures than OpenCV 4.2.0 in my environment. I am not sure it is common or not.)

With those tools, the flickering is reduced a lot but not completed. Now, I am trying instance segment pictures as another extra input. Also, I have read some commented lines in your code as _self.temporalframes = 3 and the _patchdiff function in data.py. Are those lines planned to solve the flickering? If you are still interested in their results, could you show me what the plan is and I may try them if you are too busy with it? And I appreciate any other ideas that would reduce the video flickering.

Again, thank you for your brilliant work! The more I work with this project, the more I like this whole idea!

Ray

OndrejTexler commented 3 years ago

Hello @MazeRuiZhang.

Cool, I am glad you were able to make the 1080p video quality better! And of course that I am really interested in what specific changes you did. If you want to protect your findings and not share them publicly here, we can discuss this over e-mail ondrej.texler@gmail.com, I'd love to know all the details!

Hmm ... I do not remember observing increased flickering when training for a longer period of time, but I remember that the overall quality does not necessarily increase with training time, on the contrary, actually, at some point in training-time it usually starts producing too "smoothed" or "washed" results. And it is certainly weird that different OpenCV versions produce different results, I do not recall observing any of it (well ... there might be a bug or something like that OpenCV 4.5.1 has different default parameters, etc., but I am not an OpenCV expert).

I can imagine that temporal issue will be much harder to overcome on 1920x1080 px than on lower resolution. To use segmentation definitely make sense, you can try different kinds of segmentation, edge detections, depth maps, or basically any other information that you can obtain from your data. All of these additional information has the potential to help the network to distinguish between different parts of, quite large, 1920x1080, image and reduce flickering. But if this additional information is not consistent, it might hurt the performance, e.g., segmentation is not precise and in one image, say, a hand is misdirected as background, it will only confuse the network even more.

Oh, commented lines :-) ... I should have cleaned the repo better before publishing it .... well, I was busy and in hurry, as I always am. Yes, these lines (and many more that I removed) were part of some old experiments, and many of these experiments were to solve the flickering. self.temporal_frames = 3 is a residual of the code that was taking patches from multiple consecutive frames of the video, to introduce some temporal information to the network. Well ... it was not working well, and over time we somehow converged to the state that is published in the paper and this repo. So it is not that I was planning to implement something, but I decided not to publish it, it is more than the commented code is something we tried and it was not good or we found a better way to do it. But yeah, train a network on more consecutive frames is done in vid-2-vid papers all the time, it is just not intuitive how to apply it in our usecase/repo ... you can try something along this direction, and maybe it will work :-)

And thank you so much for your kind words, and your efforts to make use of and improve our repo!

Cheers, Ondrej