chen1234520 commented 3 years ago

Hope you can help me！ this is train log.

configurations = { 1: dict( SEED = 1337, # random seed for reproduce results

    DATA_ROOT = '/lustre/users/chenguang/data/face', # the parent root where your train/val/test data are stored
    MODEL_ROOT = './model/IR_50', # the root to buffer your checkpoints
    LOG_ROOT = './log', # the root to log your train/val status
    BACKBONE_RESUME_ROOT = './weight/AsiaFace_bh-ir50/backbone_ir50_asia.pth', # the root to resume training from a saved checkpoint
    HEAD_RESUME_ROOT = './', # the root to resume training from a saved checkpoint

    BACKBONE_NAME = 'IR_50', # support: ['ResNet_50', 'ResNet_101', 'ResNet_152', 'IR_50', 'IR_101', 'IR_152', 'IR_SE_50', 'IR_SE_101', 'IR_SE_152']
    HEAD_NAME = 'ArcFace', # support:  ['Softmax', 'ArcFace', 'CosFace', 'SphereFace', 'Am_softmax']
    LOSS_NAME = 'Focal', # support: ['Focal', 'Softmax']

    INPUT_SIZE = [112, 112], # support: [112, 112] and [224, 224]
    RGB_MEAN = [0.5, 0.5, 0.5], # for normalize inputs to [-1, 1]
    RGB_STD = [0.5, 0.5, 0.5],
    EMBEDDING_SIZE = 512, # feature dimension
    BATCH_SIZE = 256,   # 512
    DROP_LAST = True, # whether drop the last batch to ensure consistent batch_norm statistics
    LR = 0.1, # initial LR
    NUM_EPOCH = 125, # total epoch number (use the firt 1/25 epochs to warm up)
    WEIGHT_DECAY = 5e-4, # do not apply to batch_norm parameters
    MOMENTUM = 0.9,
    STAGES = [35, 65, 95], # epoch stages to decay learning rate

    DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu"),
    MULTI_GPU = True, # flag to use multiple GPUs; if you choose to train with single GPU, you should first run "export CUDA_VISILE_DEVICES=device_id" to specify the GPU card you want to use
    GPU_ID = [0, 1, 2, 3], # specify your GPU ids
    PIN_MEMORY = True,
    NUM_WORKERS = 0,

), }

**Epoch 59/125 Batch 112176/237750 Training Loss 19.4058 (19.3695) Training Prec@1 0.000 (0.000) Training Prec@5 0.000 (0.000)

============================================================ Epoch 59/125 Batch 112195/237750 Training Loss 19.2279 (19.3684) Training Prec@1 0.000 (0.000) Training Prec@5 0.000 (0.000)

============================================================ Epoch 59/125 Batch 112214/237750 Training Loss 19.6988 (19.3685) Training Prec@1 0.000 (0.000) Training Prec@5 0.000 (0.000)

============================================================ Epoch: 59/125 Training Loss 19.4309 (19.3681) Training Prec@1 0.000 (0.000) Training Prec@5 0.000 (0.000)

100%|██████████| 1902/1902 [21:11<00:00, 1.50it/s] 32%|███▏ | 616/1902 [06:52<============================================================ Perform Evaluation on LFW, CFP_FF, CFP_FP, AgeDB, CALFW, CPLFW and VGG2_FP, and Save Checkpoints... Epoch 59/125, Evaluation: LFW Acc: 0.974, CPLFW Acc: 0.8041666666666666 **

sriktrako commented 3 years ago

Hi @chen1234520, I am trying to train a dataset with same config as yours, I am not able to figure out the data format required for training, currently my data is inside

D:/face.evoLVe.PyTorch/data/dataV1/ Inside dataV1 directory the data is as follows: -> id1/ -> 1.jpg -> ... -> id2/ -> 1.jpg -> ... -> ... -> ... -> ... Data is already aligned, resized to 112 using the align script provided in repo. When I run train.py, I am getting file not found error, I saw lot of people are facing the same issue, not being able to get the correct data format.

It would help a lot of people if you can guide how to get the correct dataset format for training. Help would be much appreciated, thank you.

chen1234520 commented 3 years ago

Hi @chen1234520, I am trying to train a dataset with same config as yours, I am not able to figure out the data format required for training, currently my data is inside

D:/face.evoLVe.PyTorch/data/dataV1/ Inside dataV1 directory the data is as follows: -> id1/ -> 1.jpg -> ... -> id2/ -> 1.jpg -> ... -> ... -> ... -> ... Data is already aligned, resized to 112 using the align script provided in repo. When I run train.py, I am getting file not found error, I saw lot of people are facing the same issue, not being able to get the correct data format.

It would help a lot of people if you can guide how to get the correct dataset format for training. Help would be much appreciated, thank you.

我猜测可能是你的数据路径有问题，建议你检查下config.py中的DATA_ROOT参数和train.py中的dataset_train参数得到训练数据路径值是否和训练数据的实际地址一致。

另外，如果遇到loss值无法下降或者为nan，请加载预训练模型或者调低batchsize和初始学习率。

sriktrako commented 3 years ago

Hi @chen1234520, thanks for the response.

How to generate meta, sizes files?

My DATA_ROOT = 'D:/face.evoLVe.PyTorch/data/dataV1' Actual data: D:/face.evoLVe.PyTorch/data/dataV1/Id1/1.jpg 2.jpg ..... D:/face.evoLVe.PyTorch/data/dataV1/Id2/1.jpg 2.jpg ..... Don't have any files other than .jpg's inside dataV1 directory.

Here's the exact output when I run train.py:

Overall Configurations: {'SEED': 1337, 'DATA_ROOT': 'D:/face.evoLVe.PyTorch/data/dataV1', 'MODEL_ROOT': './model', 'LOG_ROOT': './log', 'BACKBONE_RESUME_ROOT': './model/weights/backbone_ir50_asia.pth', 'HEAD_RESUME_ROOT': './', 'BACKBONE_NAME': 'IR_50', 'HEAD_NAME': 'ArcFace', 'LOSS_NAME': 'Focal', 'INPUT_SIZE': [112, 112], 'RGB_MEAN': [0.5, 0.5, 0.5], 'RGB_STD': [0.5, 0.5, 0.5], 'EMBEDDING_SIZE': 512, 'BATCH_SIZE': 512, 'DROP_LAST': True, 'LR': 0.1, 'NUM_EPOCH': 125, 'WEIGHT_DECAY': 0.0005, 'MOMENTUM': 0.9, 'STAGES': [35, 65, 95], 'DEVICE': device(type='cpu'), 'MULTI_GPU': True, 'GPU_ID': [0, 1], 'PIN_MEMORY': True, 'NUM_WORKERS': 0}

Number of Training Classes: 5749 Traceback (most recent call last): File "train.py", line 84, in lfw, cfp_ff, cfp_fp, agedb, calfw, cplfw, vgg2_fp, lfw_issame, cfp_ff_issame, cfp_fp_issame, agedb_issame, calfw_issame, cplfw_issame, vgg2_fp_issame = get_val_data(DATA_ROOT) File "D:\srikarRnD\dev\face.evoLVe.PyTorch\util\utils.py", line 63, in get_val_data lfw, lfw_issame = get_val_pair(data_path, 'lfw') File "D:\srikarRnD\dev\face.evoLVe.PyTorch\util\utils.py", line 56, in get_val_pair carray = bcolz.carray(rootdir = os.path.join(path, name), mode = 'r') File "bcolz/carray_ext.pyx", line 1067, in bcolz.carray_ext.carray.cinit File "bcolz/carray_ext.pyx", line 1369, in bcolz.carray_ext.carray._read_meta FileNotFoundError: [Errno 2] No such file or directory: 'D:/face.evoLVe.PyTorch/data/dataV1\lfw\meta\sizes'

chen1234520 commented 3 years ago

Hi @chen1234520, thanks for the response.

How to generate meta, sizes files?

My DATA_ROOT = 'D:/face.evoLVe.PyTorch/data/dataV1' Actual data: D:/face.evoLVe.PyTorch/data/dataV1/Id1/1.jpg 2.jpg ..... D:/face.evoLVe.PyTorch/data/dataV1/Id2/1.jpg 2.jpg ..... Don't have any files other than .jpg's inside dataV1 directory.

Here's the exact output when I run train.py:

Overall Configurations: {'SEED': 1337, 'DATA_ROOT': 'D:/face.evoLVe.PyTorch/data/dataV1', 'MODEL_ROOT': './model', 'LOG_ROOT': './log', 'BACKBONE_RESUME_ROOT': './model/weights/backbone_ir50_asia.pth', 'HEAD_RESUME_ROOT': './', 'BACKBONE_NAME': 'IR_50', 'HEAD_NAME': 'ArcFace', 'LOSS_NAME': 'Focal', 'INPUT_SIZE': [112, 112], 'RGB_MEAN': [0.5, 0.5, 0.5], 'RGB_STD': [0.5, 0.5, 0.5], 'EMBEDDING_SIZE': 512, 'BATCH_SIZE': 512, 'DROP_LAST': True, 'LR': 0.1, 'NUM_EPOCH': 125, 'WEIGHT_DECAY': 0.0005, 'MOMENTUM': 0.9, 'STAGES': [35, 65, 95], 'DEVICE': device(type='cpu'), 'MULTI_GPU': True, 'GPU_ID': [0, 1], 'PIN_MEMORY': True, 'NUM_WORKERS': 0}

Number of Training Classes: 5749 Traceback (most recent call last): File "train.py", line 84, in lfw, cfp_ff, cfp_fp, agedb, calfw, cplfw, vgg2_fp, lfw_issame, cfp_ff_issame, cfp_fp_issame, agedb_issame, calfw_issame, cplfw_issame, vgg2_fp_issame = get_val_data(DATA_ROOT) File "D:\srikarRnD\dev\face.evoLVe.PyTorch\util\utils.py", line 63, in get_val_data lfw, lfw_issame = get_val_pair(data_path, 'lfw') File "D:\srikarRnD\dev\face.evoLVe.PyTorch\util\utils.py", line 56, in get_val_pair carray = bcolz.carray(rootdir = os.path.join(path, name), mode = 'r') File "bcolz/carray_ext.pyx", line 1067, in bcolz.carray_ext.carray.cinit File "bcolz/carray_ext.pyx", line 1369, in bcolz.carray_ext.carray._read_meta FileNotFoundError: [Errno 2] No such file or directory: 'D:/face.evoLVe.PyTorch/data/dataV1\lfw\meta\sizes'

This is not a training data error, You have no valdata.The author uses many valdata by default.You need to modify the settings of the valdata if you don't have enough valdata.

Here is my example. 希望能帮助到你. I only use LFW and cplfw.

lfw, cplfw, lfw_issame, cplfw_issame = get_val_data(DATA_ROOT)

def get_val_data(data_path): lfw, lfw_issame = get_val_pair(data_path, 'lfw_align_112/lfw')

cfp_ff, cfp_ff_issame = get_val_pair(data_path, 'cfp_ff')

# cfp_fp, cfp_fp_issame = get_val_pair(data_path, 'cfp_fp')
# agedb_30, agedb_30_issame = get_val_pair(data_path, 'agedb_30')
# calfw, calfw_issame = get_val_pair(data_path, 'calfw')
# cplfw, cplfw_issame = get_val_pair(data_path, 'cplfw')
cplfw, cplfw_issame = get_val_pair(data_path, 'cplfw_align_112/cplfw')
# vgg2_fp, vgg2_fp_issame = get_val_pair(data_path, 'vgg2_fp')

abhiksark commented 3 years ago

Hi @chen1234520 how did you resolve this?

ZhaoJ9014 / face.evoLVe

Training Loss 19.6988. Can't continue to fall #165

**Epoch 59/125 Batch 112176/237750 Training Loss 19.4058 (19.3695) Training Prec@1 0.000 (0.000) Training Prec@5 0.000 (0.000)

============================================================ Epoch 59/125 Batch 112195/237750 Training Loss 19.2279 (19.3684) Training Prec@1 0.000 (0.000) Training Prec@5 0.000 (0.000)

============================================================ Epoch 59/125 Batch 112214/237750 Training Loss 19.6988 (19.3685) Training Prec@1 0.000 (0.000) Training Prec@5 0.000 (0.000)

============================================================ Epoch: 59/125 Training Loss 19.4309 (19.3681) Training Prec@1 0.000 (0.000) Training Prec@5 0.000 (0.000)

cfp_ff, cfp_ff_issame = get_val_pair(data_path, 'cfp_ff')