about training device - Githubissues

long280 commented 7 months ago

Hello, thank you very much for your work, I would like to ask about the equipment you use when training your code. And why is the fc_weight file given in your paper only 4kb, but the model file I trained is 1~2GB?

255doesnotexist commented 2 months ago

They may uploaded a git-lfs pointer file instead of a full pretrain model... And on Tesla P40 I need 6~7s for a single iteration that caused about \~20 min to see the first loss output. Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256. For 1 epoch, ~1.5hrs needed.

255doesnotexist commented 2 months ago

They may uploaded a git-lfs pointer file instead of a full pretrain model... And on Tesla P40 I need 6~7s for a single iteration that caused about ~20 min to see the first loss output. Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256. For 1 epoch, ~1.5hrs needed.

UPD: but validation result are not so good. idk if i get anything wrong in training process? i peformed training by adress fake/real_list_path manually in command args and just let it run and save the early stop checkpoints. and it just give me very low ap score on genimage valset.

oceanzhf commented 1 month ago

They may uploaded a git-lfs pointer file instead of a full pretrain model... And on Tesla P40 I need 6~7s for a single iteration that caused about ~20 min to see the first loss output. Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256. For 1 epoch, ~1.5hrs needed.

UPD: but validation result are not so good. idk if i get anything wrong in training process? i peformed training by adress fake/real_list_path manually in command args and just let it run and save the early stop checkpoints. and it just give me very low ap score on genimage valset.

I would like to ask, why does it take two hours to run one epoch? Is this normal? How did you perform training by addressing fake/real_list_path manually in the command args?

255doesnotexist commented 1 month ago

They may uploaded a git-lfs pointer file instead of a full pretrain model... And on Tesla P40 I need 6~7s for a single iteration that caused about ~20 min to see the first loss output. Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256. For 1 epoch, ~1.5hrs needed.

UPD: but validation result are not so good. idk if i get anything wrong in training process? i peformed training by adress fake/real_list_path manually in command args and just let it run and save the early stop checkpoints. and it just give me very low ap score on genimage valset.

I would like to ask, why does it take two hours to run one epoch? Is this normal?

IDK. But it may be normal because no runtime exception was thrown in training process.

How did you perform training by addressing fake/real_list_path manually in the command args?

You should modify data/datasets.py, add a data mode like 'manually'.

elif opt.data_mode == 'manually':
    real_list = get_list( os.path.join(opt.real_list_path) )
    fake_list = get_list( os.path.join(opt.fake_list_path) )

Because there is no .pickle in this path so it just triggered recursive searching in your image dataset path.

Address your real and fake image path and it works soon.

oceanzhf commented 1 month ago

They may uploaded a git-lfs pointer file instead of a full pretrain model... And on Tesla P40 I need 6~7s for a single iteration that caused about ~20 min to see the first loss output. Also, in default hyper parameters it takes non-less than 13.5G VRAM to train on GenImage SD-V1.4 dataset with bs256. For 1 epoch, ~1.5hrs needed.

UPD: but validation result are not so good. idk if i get anything wrong in training process? i peformed training by adress fake/real_list_path manually in command args and just let it run and save the early stop checkpoints. and it just give me very low ap score on genimage valset.

I would like to ask, why does it take two hours to run one epoch? Is this normal?

IDK. But it may be normal because no runtime exception was thrown in training process.

How did you perform training by addressing fake/real_list_path manually in the command args?

You should modify data/datasets.py, add a data mode like 'manually'.
elif opt.data_mode == 'manually':
    real_list = get_list( os.path.join(opt.real_list_path) )
    fake_list = get_list( os.path.join(opt.fake_list_path) )
Because there is no .pickle in this path so it just triggered recursive searching in your image dataset path.

Address your real and fake image path and it works soon.

Thank you very much for your answer. When you validate with your trained .pth file, do you encounter this error: RuntimeError: Error(s) in loading state_dict for Linear: Missing key(s) in state_dict: 'weight', 'bias'. Unexpected key(s) in state_dict: 'model', 'optimizer', 'total_steps'?

WisconsinAIVision / UniversalFakeDetect

about training device #13