davidsandberg / facenet

Face recognition using Tensorflow
MIT License
13.83k stars 4.81k forks source link

train classifier OutOfRangeError #210

Closed LittlePeng closed 7 years ago

LittlePeng commented 7 years ago

command:

➜ facenet git:(master) ✗ python src/facenet_train_classifier.py --logs_base_dir ./logs/ --models_base_dir ./model/ --data_dir ~/data2/mscelebv1/merged_mtcnn --image_size 160 --model_def models.inception_resnet_v1 --lfw_dir ~/data2/mscelebv1/merged_mtcnn_lfw2 --lfw_pairs ~/data2/mscelebv1/merged_mtcnn_lfw/pairs.txt --optimizer RMSPROP --learning_rate -1 --max_nrof_epochs 80 --keep_probability 0.8 --random_crop --random_flip --learning_rate_schedule_file data/learning_rate_schedule_classifier_casia.txt --weight_decay 5e-5 --center_loss_factor 1e-4 --center_loss_alfa 0.9 --epoch_size 2 --batch_size 50

output:

Model directory: ./model/20170319-215201
Log directory: ./logs/20170319-215201
LFW directory: /data2/mscelebv1/merged_mtcnn_lfw2
Total number of classes: 5285
Total number of examples: 301320

Running training
Epoch [0][1/2] begin ...
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 3676 get requests, put_count=2349 evicted_count=1000 eviction_rate=0.425713 and unsatisfied allocation rate=0.660229
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
Epoch: [0][1/2] Time 4.132  Loss 26.845 RegLoss 17.747
Epoch [0][2/2] begin ...
Epoch: [0][2/2] Time 3.211  Loss 26.086 RegLoss 16.965
Saving variables
Variables saved in 0.49 seconds
Saving metagraph
Metagraph saved in 5.50 seconds
Runnning forward pass on LFW images
Accuracy: 0.503+-0.006
Validation rate: 0.00100+-0.00213 @ FAR=0.00000
Epoch [1][1/2] begin ...
W tensorflow/core/framework/op_kernel.cc:993] Invalid argument: Invalid PNG header, data size 4293
     [[Node: DecodePng_2 = DecodePng[channels=0, dtype=DT_UINT8, _device="/job:localhost/replica:0/task:0/cpu:0"](ReadFile_2)]]
W tensorflow/core/framework/op_kernel.cc:993] Invalid argument: Invalid PNG header, data size 4293
     [[Node: DecodePng_2 = DecodePng[channels=0, dtype=DT_UINT8, _device="/job:localhost/replica:0/task:0/cpu:0"](ReadFile_2)]]
Epoch: [1][1/2] Time 0.798  Loss 20.633 RegLoss 11.442
Epoch [1][2/2] begin ...
W tensorflow/core/framework/op_kernel.cc:993] Out of range: FIFOQueue '_1_batch_join/fifo_queue' is closed and has insufficient elements (requested 50, current size 0)
     [[Node: batch_join = QueueDequeueUpToV2[component_types=[DT_FLOAT, DT_INT64], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch_join/fifo_queue, _recv_batch_size_0)]]
W tensorflow/core/framework/op_kernel.cc:993] Out of range: FIFOQueue '_1_batch_join/fifo_queue' is closed and has insufficient elements (requested 50, current size 0)
     [[Node: batch_join = QueueDequeueUpToV2[component_types=[DT_FLOAT, DT_INT64], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch_join/fifo_queue, _recv_batch_size_0)]]
W tensorflow/core/framework/op_kernel.cc:993] Out of range: FIFOQueue '_1_batch_join/fifo_queue' is closed and has insufficient elements (requested 50, current size 0)
     [[Node: batch_join = QueueDequeueUpToV2[component_types=[DT_FLOAT, DT_INT64], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch_join/fifo_queue, _recv_batch_size_0)]]
W tensorflow/core/framework/op_kernel.cc:993] Out of range: FIFOQueue '_1_batch_join/fifo_queue' is closed and has insufficient elements (requested 50, current size 0)

    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: FIFOQueue '_1_batch_join/fifo_queue' is closed and has insufficient elements (requested 50, current size 0)
     [[Node: batch_join = QueueDequeueUpToV2[component_types=[DT_FLOAT, DT_INT64], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](batch_join/fifo_queue, _recv_batch_size_0)]]

Caused by op u'batch_join', defined at:
  File "src/facenet_train_classifier.py", line 469, in <module>
    main(parse_arguments(sys.argv[1:]))
  File "src/facenet_train_classifier.py", line 144, in main
    allow_smaller_final_batch=True)

my dataset, 5K+ classes 30w images: Total number of classes: 5285 Total number of examples: 301320

reference to #105 OutOfRangeError occurred during training

i try set small batch_size,--epoch_size 2 --batch_size 50, but crash at 2nd epoch ?

ugtony commented 7 years ago

@LittlePeng Check if there is any broken image or irrelevant file(i.e., *.jpg) in your subfolders.

LittlePeng commented 7 years ago

@ugtony thanks

for some mtnn aligned .png files has a 90 degree,then i edit (rotation) in mac. after clean those files, it works.