facebookresearch / LASER

Language-Agnostic SEntence Representations
Other
3.59k stars 461 forks source link

issue with mine_bitexts.py #49

Open afarajian opened 5 years ago

afarajian commented 5 years ago

Hi, I am trying to mine some parallel sentences from two large monolingual corpora (over 40M sentences each). In the first step I encoded the two sides and then called mine_bitexts.py to do the magic and extract the most probable sentence pairs. However, I faced a memory issue so I decided to just load the embeddings of the target side and to keep the memory footage minimal at each time I just encode one source and try to mine the candidates of that single sentence. But, still I get the following error:

Faiss assertion 'err__ == cudaSuccess' failed in virtual void faiss::gpu::StandardGpuResources::initializeForDevice(int) at StandardGpuResources.cpp:168; details: CUDA error 2 go-align.sh: line 56: 23772 Aborted (core dumped) python ${LASER}/source/mine_bitexts.py

To reduce even further the memory usage I decrease the batch size so that at each time it just reads a small batch of target embeddings and compares the source embedding with them. But still no success. This issue seems to be related to FAISS and I found the following thread in the FAISS issues: https://github.com/facebookresearch/faiss/issues/231 But, couldn't find a solution which works for me. Any ideas about this? I am running my experiments on 4 Tesla k80 gpus and the corpora contain about 50-60M sentences each The only other solution that I could think of is to split to target corpus into smaller batches of say 10M sentences and for each source sentence get its most probable candidate in each batch. Then I need to go through the list of all the extracted candidates for each source sentence and return the best as the most similar candidate. May I ask you if you ever faced this issue and if you have a better solution for it?

Thank you, Amin

afarajian commented 5 years ago

Just now I ran the code on just one gpu with a small batch and got the following error:

- perform 4-nn source against target - perform 4-nn target against source Traceback (most recent call last): File "/home/amin/NLP/tools/LASER//source/mine_bitexts.py", line 233, in <module> y2x_sim, y2x_ind = knn(y, x, args.neighborhood, args.gpu) File "/home/amin/NLP/tools/LASER//source/mine_bitexts.py", line 75, in knn return knnGPU(x, y, k) if use_gpu else knnCPU(x, y, k) File "/home/bertuser/NLP/tools/LASER//source/mine_bitexts.py", line 88, in knnGPU ind = np.zeros((x.shape[0], k), dtype=np.int64) MemoryError

Any idea how to fix this?

hoschwenk commented 5 years ago

How much memory has your machine ? I'll try a similar configuration and come back to you in a couple of days

afarajian commented 5 years ago

well, I am using Tesla k80 machines with 11GB ram.

NomadXD commented 3 years ago

@hoschwenk @PersianNLPer Any update ? I'm having a similar OOM issue on a machine with similar specs.

avidale commented 1 year ago

The code for the solution that you propose (splitting the corpus in shards, performing mining for each shard separately, and merging the results) has been implemented in the Stopes package for parallel corpus mining released last year: https://github.com/facebookresearch/stopes/tree/main/stopes/pipelines/bitext#splitting-and-merging-languages