This PR contains many smaller and bigger changes to support a fully masked training workflow.
The structure of the latent cache has changed a bit. The "extra" data is used for the depth information and the conditioning image for inpainting. I added a new variable for the mask, that can now be used for all models.
Features:
It is now possible to manually edit the mask in Caption Buddy. The UX is not perfect (it lags for bigger images, and the cursor does not represent the bursh size), but I still think it is better than nothing.
The fraction of unmasked steps is now configurable, was previously hardcoded at 25%. If you want to make sure that the model does not learn anything outside of the masked area, this should be set to 0%.
The max amount of denoising steps during training are configurable. The idea behind this is pretty simple. If you task the model to predict the noise in an image that is only noise, it will probably predict something that is not even close to the original image. You might get something that still fits the prompt, but the composition is different. That makes the normal mse loss function useless. If you limit the noise to something like 75%, you will get much closer to the actual image composition. This is especially useful for masked training, where the predicted subject might be completely outside of the original masked area. It can also reduce overfitting.
A new option to normalize the importance of all masked images. (Idea taken from https://twitter.com/cloneofsimo/status/1608454983661551616). Basically means, images with smaller masks are trained just as much as images with bigger masks. This is very useful if you want to train on images where the subject does not have the same size in all images. Like a person where some images are closups and other images are full body shots.
And the biggest one: Training on masked images is possible for all model types now. This is done by using a modified loss function that only focuses on the masked parts of the image.
Fixes
The model playground now works with inpainting and depth2img models
The flip probability was evaluated separately for the image, mask and depth. This could lead to situations where the image was flipped, but the mask was not.
Some more fixes for bugs I found along the way. Mostly related to inpainting and depth2img training
This PR contains many smaller and bigger changes to support a fully masked training workflow.
The structure of the latent cache has changed a bit. The "extra" data is used for the depth information and the conditioning image for inpainting. I added a new variable for the mask, that can now be used for all models.
Features:
Fixes