SunnyHaze / IML-ViT

Official repository of paper “IML-ViT: Benchmarking Image manipulation localization by Vision Transformer”
MIT License
184 stars 23 forks source link

Trouble training the model #3

Closed fedakhalid closed 8 months ago

fedakhalid commented 9 months ago

hello, i am currently working on finetuning your model on my dataset but i dont know how to train it. i followed the piece of code you provided and i got this:

model = iml_vit_model( vit_pretrain_path = './pretrained-weights/mae_pretrain_vit_base.pth', edge_lambda = 20 )

and it runs as it should (printing that the weights have been loaded from the path) but after this step im not sure how to continue in order to train it. I have my dataset in the form of a manidataset as provided in your code. Any help would be deeply appreciated !

SunnyHaze commented 9 months ago

Hi, thanks for your attention to our work. If you have done the process of init a model object, then a typical "training epoch" should contain the process like this:

predict_loss, predict_mask, edge_loss = model(images, masks, edge_mask)

Where predict_loss and edge_loss are float values and predict_mask is the prediction.

Of course, you need to add some compulsory code for CUDA or distributed training. You can also follow the inference process in the Demo.ipynb to create your training process. If you have further or detailed questions, please let me know.

Due to this paper being under review, we will release the official training code after the outcome. It may come soon, please stay tuned.

fedakhalid commented 9 months ago

Thank you for your response. I really appreciate it as it will greatly help me in my bachelor thesis.

I just want to confirm my understanding, and i apologize if my questions are trivial as i am not yet experienced in training vision transformers outside of huggingface. so essentially if i want to have what can be considered 10 epochs or whatever n number of training epochs, i should run the code piece you provided n number of times, for example in a loop, and that will translate into the model learning as these values keep updating per epoch, is that correct?

How do i go about measuring the accuracy in this particular case once i start to test the new model? Aside from visually seeing the prediction masks, are the predict_loss and edge_loss the only values i can use to track progress?

Ma, Xiaochen (马晓晨) @.***> schrieb am Do. 7. Dez. 2023 um 9:03 AM:

Hi, thanks for your attention to our work. If you have done the process of init a model object, then a typical "training epoch" should contain the process like this:

predict_loss, predict_mask, edge_loss = model(images, masks, edge_mask)

Where predict_loss and edge_loss are float values and predict_mask is the prediction.

Of course, you need to add some compulsory code for CUDA or distributed training. You can also follow the inference process in the Demo.ipynb to create your training process. If you have further or detailed questions, please let me know.

Due to this paper being under review, we will release the official training code after the outcome. It may come soon, please stay tuned.

— Reply to this email directly, view it on GitHub https://github.com/SunnyHaze/IML-ViT/issues/3#issuecomment-1844785334, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZCBWH2TM7XC6T6UQ4W4RVDYIFS5TAVCNFSM6AAAAABAKQBUXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBUG44DKMZTGQ . You are receiving this because you authored the thread.Message ID: @.***>

SunnyHaze commented 9 months ago

Honestly speaking this project is also my bachelor thesis. And there is no need to apologize since everyone grows up in a situation of no experience.

Detailed to answer your first question, it may not be correct. I am not sure whether you are familiar with deep learning or not. But in predict_loss, predict_mask, edge_loss = model(images, masks, edge_mask) it only contains the forwarding process. This means that you have to manually implement backward() on the loss value (in this case, it's predict_loss), including various settings like learning rate.

More simply, you may check out a PyTorch training tutorial for guidance. The class iml_vit_model may be similar to a returned object form function torchvision.models.vgg16() or torchvision.models.resnet50(). It does not contain a backward process(or described as training). Thus you have to manually implement it for training.

Besides, for measuring the accuracy, I recommend you split your dataset into train/test sets and periodically (for each n epoch you like) test the F1 score on the test set. This can help you monitor the converging process of the model.

Hope this helps you, feel free to ask if you have further questions.

fedakhalid commented 9 months ago

Thank you for the explanation!

I have a question about the loss function and the predict_loss value. Logically, I assume the value to pass as the target to the loss function (in my case, CrossEntropyLoss) would be the groundtruth mask but there’s obviously the mismatch in shapes and types and ive attempted multiple things to reconcile these mismatches but i always get an “Dimension out of range (expected to be in range of [-1,0] but got 1)” error

I did find a workaround: When i dont use a loss function and simply do a backwards pass without it my epoch runs but 1) the running time is really long per epoch (about 2 hours, which i dont think is normal for a 160 picture dataset, but please correct me if im wrong) so i have not yet been able to analyze the results and 2) i know the lack of a loss function is highly irregular and might not lead to meaningful updates in the model so im not too optimistic about the results.

I did check out tutorials and forums for guidances but nothing has helped this particular problem so im wondering what could be done here. Also I dont know if it makes a difference but i am currently relying on pytorch’s autograd backward() method.

Ma, Xiaochen (马晓晨) @.***> schrieb am Sa. 9. Dez. 2023 um 2:18 AM:

Honestly speaking this project is also my bachelor thesis. And there is no need to apologize since everyone grows up in a situation of no experience.

Detailed to answer your first question, it may not be correct. I am not sure whether you are familiar with deep learning or not. But in predict_loss, predict_mask, edge_loss = model(images, masks, edge_mask) it only contains the forwarding process. This means that you have to manually implement backward() on the loss value (in this case, it's predict_loss), including various settings like learning rate.

More simply, you may check out a PyTorch training tutorial for guidance. The class iml_vit_model may be similar to a returned object form function torchvision.models.vgg16() or torchvision.models.resnet50(). It does not contain a backward process(or described as training). Thus you have to manually implement it for training.

Besides, for measuring the accuracy, I recommend you split your dataset into train/test sets and periodically (for each n epoch you like) test the F1 score on the test set. This can help you monitor the converging process of the model.

Hope this helps you, feel free to ask if you have further questions.

— Reply to this email directly, view it on GitHub https://github.com/SunnyHaze/IML-ViT/issues/3#issuecomment-1847997398, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZCBWHYHWBQA5LE5RKTUPFLYIOU5LAVCNFSM6AAAAABAKQBUXWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBXHE4TOMZZHA . You are receiving this because you authored the thread.Message ID: @.***>

SunnyHaze commented 9 months ago

I am not sure whether I understand you correctly. But I need to point out that the predict_loss in iml_vit_model is already a value calculated with nn.BCEWithLogitsLoss() you can check it here.

https://github.com/SunnyHaze/IML-ViT/blob/319944341ebac800bac0d2deae6d677f280b3d64/iml_vit_model.py#L83

https://github.com/SunnyHaze/IML-ViT/blob/319944341ebac800bac0d2deae6d677f280b3d64/iml_vit_model.py#L121

Thus, there is no need to calculate the loss again. You only backward this predict_loss value is okay. I am not sure if you calculate the loss between a single value (predict_loss) and an N, C, H, W matrix (ground truth). If you do like this, it's incorrect. If I am wrong, please let me know. I think this will also solve your concern about the "lack of a loss function". Since the true loss function is actually hidden in the previously mentioned lines.

Besides, I don't know what device or GPU you are using. I can provide you with an example to analyze the speed. We trained the model on full CASIAv2 (12000 images) with two NVIDIA 3090 GPUs, it takes about 3 days + 12 hours.

Hope this information helps you. If you have further questions, please let me know.

SunnyHaze commented 9 months ago

I am sorry that I didn't mention that the 3 days + 12 hours is time for training 200 epochs on 12000 images. Thus, it takes about 25 minutes to train a single epoch with 2 NVIDIA 3090 GPUs.