RitwikGupta / xView2-Vulcan-Model

MIT License
6 stars 2 forks source link

Inference raises resource limit error #30

Closed pounde closed 3 years ago

pounde commented 4 years ago

Running models of size 154... => loading checkpoint 'se154_loc_0_1_best' loaded first layer loaded checkpoint 'se154_loc_0_1_best' (epoch 17, best_score 0.8847239081178774) Assigning model to GPU 1 => loading checkpoint 'se154_loc_1_1_best' loaded first layer loaded checkpoint 'se154_loc_1_1_best' (epoch 21, best_score 0.8572652320139451) Assigning model to GPU 1 => loading checkpoint 'se154_loc_2_1_best' loaded first layer loaded checkpoint 'se154_loc_2_1_best' (epoch 29, best_score 0.868695268679399) Assigning model to GPU 1 => loading checkpoint 'se154_cls_cce_0_tuned_best' loaded first layer loaded checkpoint 'se154_cls_cce_0_tuned_best' (epoch 2, best_score 0.7846007167374233) Assigning model to GPU 0 => loading checkpoint 'se154_cls_cce_1_tuned_best' loaded first layer loaded checkpoint 'se154_cls_cce_1_tuned_best' (epoch 2, best_score 0.759847648024409) Assigning model to GPU 0 => loading checkpoint 'se154_cls_cce_2_tuned_best' loaded first layer loaded checkpoint 'se154_cls_cce_2_tuned_best' (epoch 2, best_score 0.7933732879639208) Assigning model to GPU 0 Running inference... 100%|█████████████████████████████████████████| 371/371 [17:30<00:00, 2.83s/it] 100%|█████████████████████████████████████████| 371/371 [38:09<00:00, 6.17s/it] Traceback (most recent call last): File "handler.py", line 592, in main() File "handler.py", line 467, in main results_dict.update({k:v for k,v in return_dict.items()}) File "", line 2, in items File "/home/ubuntu/anaconda3/envs/xv2/lib/python3.7/multiprocessing/managers.py", line 819, in _callmethod kind, result = conn.recv() File "/home/ubuntu/anaconda3/envs/xv2/lib/python3.7/multiprocessing/connection.py", line 251, in recv return _ForkingPickler.loads(buf.getbuffer()) File "/home/ubuntu/anaconda3/envs/xv2/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd fd = df.detach() File "/home/ubuntu/anaconda3/envs/xv2/lib/python3.7/multiprocessing/resource_sharer.py", line 58, in detach return reduction.recv_handle(conn) File "/home/ubuntu/anaconda3/envs/xv2/lib/python3.7/multiprocessing/reduction.py", line 185, in recv_handle return recvfds(s, 1)[0] File "/home/ubuntu/anaconda3/envs/xv2/lib/python3.7/multiprocessing/reduction.py", line 161, in recvfds len(ancdata)) RuntimeError: received 0 items of ancdata

pounde commented 3 years ago

Have more recently not had a re-occurrence. Closing.

pounde commented 3 years ago

Same issue during inference with a large number of chips.

pounde commented 3 years ago

OS issue. Fixed with edit of /etc/security/limits.conf. Up file limit with: * hard nofile 200000 * soft nofile 200000 20000 may be excessive but it fixed the issue. Just sayin'.