facebookresearch / deepcluster

Deep Clustering for Unsupervised Learning of Visual Features
Other
1.69k stars 325 forks source link

error 2 out of memory #30

Closed Gvaihir closed 5 years ago

Gvaihir commented 5 years ago

Hi! I'm training AlexNet with PIC on NVIDIA Tesla M60 GPU (AWS g3.4xlarge instance), 800e3 images. After 2-3 epochs I get the following:

Compute features
0 / 3175        Time: 5.483 (5.483)
200 / 3175      Time: 0.824 (0.681)
400 / 3175      Time: 0.611 (0.680)
600 / 3175      Time: 0.794 (0.681)
800 / 3175      Time: 0.611 (0.673)
1000 / 3175     Time: 0.620 (0.676)
1200 / 3175     Time: 0.609 (0.671)
1400 / 3175     Time: 0.810 (0.674)
1600 / 3175     Time: 0.611 (0.670)
1800 / 3175     Time: 0.724 (0.675)
2000 / 3175     Time: 0.829 (0.672)
2200 / 3175     Time: 0.616 (0.674)
2400 / 3175     Time: 0.806 (0.675)
2600 / 3175     Time: 0.609 (0.670)
2800 / 3175     Time: 0.608 (0.666)
3000 / 3175     Time: 0.613 (0.662)
Traceback (most recent call last):
  File "main.py", line 320, in <module>
    main()
  File "main.py", line 152, in main
    clustering_loss = deepcluster.cluster(features, verbose=args.verbose)
  File "/home/aogorodnikov/deepcluster/clustering.py", line 338, in cluster
    I, D = make_graph(xb, self.nnn)
  File "/home/aogorodnikov/deepcluster/clustering.py", line 117, in make_graph
    index = faiss.GpuIndexFlatL2(res, dim, flat_config)
  File "/home/aogorodnikov/anaconda3/envs/imgSudoku/lib/python3.7/site-packages/faiss/__init__.py", line 333, in replacement_init
    original_init(self, *args)
  File "/home/aogorodnikov/anaconda3/envs/imgSudoku/lib/python3.7/site-packages/faiss/swigfaiss.py", line 5430, in __init__
    this = _swigfaiss.new_GpuIndexFlatL2(*args)
RuntimeError: Error in void faiss::gpu::allocMemorySpaceV(faiss::gpu::MemorySpace, void**, size_t) at gpu/utils/MemorySpace.cpp:27: Error: 'err == cudaSuccess' failed: failed to cudaMalloc 1073741824 bytes (error 2 out of memory)

I saw the issue to originate from Faiss library. Can you advice anything from your side? Thanks!

mathildecaron31 commented 5 years ago

Hi, Reading the message error it seems that faiss doesn't realize that the unoccupied cached memory currently held by the caching allocator of pytorch is free to use. I've had a similar issue. To solve this, in this version of the code (pytorch 0.2), I was dedicating a GPU only for PIC clustering in order to avoid conflict with pytorch.

With more recent versions of pytorch, this function is very helpful and allows to fix the issue. Hope that helps