cavalleria / cavaface

face recognition training project(pytorch)
MIT License
459 stars 87 forks source link

some problem about megaface evaluation #28

Closed ReverseSystem001 closed 4 years ago

ReverseSystem001 commented 4 years ago

When I evaluate the Mobilefacenet model, the megaface evaluation works well. But When i evaluate the Ghostnet, it reports error like:

Total Tensors: 4096752 Used Memory: 15.70M The allocated memory on cuda:0: 77.55M Memory differs due to the matrix alignment or invisible gradient buffer tensors

min gpu free mem: 8000000000.0 B min gpu free mem: 8000000000.0 B min gpu free mem: 102000000 B min gpu free mem: 162000000 B Finish loading model /home/vision_rd/face_Recognition/models/GhostNet_Arcface/model/Epoch_24_Time_2020-07-21-13-27_checkpoint.pth, infer with shape: (198, 3, 112, 112) Loading model time cost: 43.907608 seconds.

Extract on megaface... Noisy faces of scrub: 605 Noisy faces of gallery: 707 Begin to extract embedding of scrub faces... Finish Load path of faces: 0/3530 begin thread

Segmentation fault: 11

Stack trace: [bt] (0) /usr/local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x41a8280) [0x7f6276fd8280] [bt] (1) /lib64/libc.so.6(+0x363b0) [0x7f635a6483b0] [bt] (2) /lib64/libc.so.6(cfree+0x1c) [0x7f635a697ecc] [bt] (3) /usr/lib64/python3.6/site-packages/cv2/cv2.cpython-36m-x86_64-linux-gnu.so(+0x4eda39) [0x7f62b277da39] [bt] (4) /usr/lib64/python3.6/site-packages/cv2/cv2.cpython-36m-x86_64-linux-gnu.so(+0x168cd5) [0x7f62b23f8cd5] [bt] (5) /lib64/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x147) [0x7f635b3ea167] [bt] (6) /lib64/libpython3.6m.so.1.0(+0x1507df) [0x7f635b4557df] [bt] (7) /lib64/libpython3.6m.so.1.0(_PyEval_EvalFrameDefault+0x3a7) [0x7f635b44a0f7] [bt] (8) /lib64/libpython3.6m.so.1.0(+0x14f987) [0x7f635b454987] Eval model: /home/vision_rd/yangwenbo/face_Recognition/models/GhostNet_Arcface/model/Epoch_24_Time_2020-07-21-13-27_checkpoint.pth,24, done! I doubt why I only changed the network model(has been trained). it report errors.

cavalleria commented 4 years ago

I re-evalualate the ghostnet and have no such bugs. please use this scripts

MODEL="/workspace/results/GhostNet_Arcface/model_ep24_2.pth,24" python -u ./main.py \ --eval_sets="megaface" \ --model_type=pytorch_fp32 \ --gpus "0,1,2,3" \ --net_scale "light" \ --model_path=${MODEL} echo "Eval model: ${MODEL}, done!"

MODEL="/workspace/results/GhostNet_Arcface/model_ep24_2.pth,24" python -u ./main.py \ --eval_sets="ijbc" \ --model_type=pytorch_fp32 \ --gpus "0,1,2,3" \ --net_scale "light" \ --model_path=${MODEL} echo "Eval model: ${MODEL}, done!"

cavalleria commented 4 years ago

The model torch.jit.save saved can be evaluated without defining model file.

ReverseSystem001 commented 4 years ago

yes, it requires install libtorch. or it will report C++ errors when use torch.jit.load(). THX

ReverseSystem001 commented 4 years ago

I revalueted the model I had trained. it works. maybe it is the GPU memory problems. thx

XWalways commented 4 years ago

when running r = requests.post("http://127.0.0.1:%d/eval"%(eval_info["args"].port), data=eval_info).json() Response [404], so .json() returns

simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

how to solve this?