about training - Githubissues

yelusaleng commented 3 years ago

hi, thanks for your works, it's great for researchers.

I have some questions about the repo. The test script can be run executed. However, the training script has some errors. No matter what dataset I use, the loss value stays the same after it is reduced to 0.693. So I'm asking if you run this training script successfully?

RonyAbecidan commented 3 years ago

Hi, thank you for pointing out this issue.

Let me check what's going on. I didn't observed that in practice but maybe because I didn't wait enough time to see it.

In the meantime, maybe you could play with the learning rate. It could be a reason why the loss stays constant.

yelusaleng commented 3 years ago

@RonyAbecidan , tks for your response. I have tried various learning rates to train, but the above issue still occurs. Is it possible that this code works fine for forward propagation, but has problems with backward propagation during training?

RonyAbecidan commented 3 years ago

The authors of the paper succeeded in training their algorithm. Maybe I didn't completely understand how because it is not explicitly shared by them. I am going to see what I can do ;)

yelusaleng commented 3 years ago

many tks. I will also try to repair the issue.

RonyAbecidan commented 3 years ago

0.693 is the value of the loss for a random classifier so this bug is really bad. We need to understand what is going on.

yelusaleng commented 3 years ago

yep

RonyAbecidan commented 3 years ago

Just to be sure, you observed that with your own dataset and not with my random datasets in the notebook right ?

yelusaleng commented 3 years ago

yes, i'm sure.

RonyAbecidan commented 3 years ago

By curiosity, I tested the training of the model with a constant forgery mask (a big white square in the left corner) and it works. So it seems that there is no something that forces the model to act as random for any dataset. Do you work with confidential datasets or can you share one of them to see what's going on with them ?

yelusaleng commented 3 years ago

its my mistake, i have solved the issue by changing the criterion nn.BCEWithLogitsLoss() to nn.BCELoss(). by the way, the pytorch_lightning script have some bugs for multiplt-gpus training. pytorch is better than pytorch_lightning. thanks again!

RonyAbecidan commented 3 years ago

Ok, I used nn.BCELoss to let the sigmoid at the final layer so that we can immediately see the forgery mask at the outputs. I am not sure about the bugs you mention for pytorch-lightning but I am sure that they can be fixed. Pytorch-lightning is a wrapper of Pytorch so saying than one is better than the other is debatable. I like pytorch-lightning for its simplicity and code structure but everyone has its own preferences ;)

RonyAbecidan commented 3 years ago

I close the issue now thank you for your responsiveness =)

yelusaleng commented 3 years ago

i understand that. i think the optimization of pytorch_lightning is not enough. when i use the same batchsize for the two pytorch_lightning and pytorch, the pytorch_lightning show the OOM error, but pytorch not. But none of that matters anymore, and I still appreciate your response.

yelusaleng commented 3 years ago

In addition, I made one change to your code for multi-GPU training. i move the code lines 528-532 to 544-548.

        self.end = nn.Sequential(nn.Conv2d(8, 1, 7, 1, padding=3),
                                 nn.Sigmoid())

    def forward(self, x):
        B, nb_channel, H, W = x.shape

        # Normalization
        x = x / 255. * 2 - 1

        ## Image Manipulation Trace Feature Extractor
        self.bayar_mask = torch.tensor(np.ones(shape=(5, 5))).to(device=self.device)
        self.bayar_mask[2, 2] = 0

        self.bayar_final = torch.tensor(np.zeros((5, 5))).to(device=self.device)
        self.bayar_final[2, 2] = -1

        ## **Bayar constraints**
        self.BayarConv2D.weight.data *= self.bayar_mask

RonyAbecidan commented 3 years ago

I don't know if it is optimal because you are creating the masks at every forward pass.

yelusaleng commented 3 years ago

the reason i changed the code is it show the error that self.bayar_mask and self.BayarConv2D.weight.data are on different device 'cuda: 0' and 'cuda: 1' when I try multiple-gpu training.

RonyAbecidan commented 3 years ago

Yes I understand. Maybe you could do a condition like 'if self.bayar_mask exist then else self.bayar_mask=...' to avoid re-building it everytime

yelusaleng commented 3 years ago

OK. many thanks.

yelusaleng commented 3 years ago

hi, i still have some questions about the training. i tried to train the Local Anomaly Detection Network (LADN) while freezing the weights of image manipulation trace feature extractor. However, the LADN can not be converged. Although I fail to see the problem with your code, I take the liberty to ask if your code can only be applied for testing and not for training?

RonyAbecidan commented 3 years ago

Hello, I have not explicitly coded something that prevent the model from being trained. It could be used for training and you can see it with very simple cases (for instance, if you take a constant mask for every image, the training works). However I admit that knowing how to properly train it with a new dataset is difficult and the authors of the papers didn't really share this information. I will be happy to find how to train it correctly with the help of the community ;)

yelusaleng commented 3 years ago

ok, we try together.

RonyAbecidan / ManTraNet-pytorch

about training #4