facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.93k stars 4.72k forks source link

Unsupervised alignment issue #1165

Open Gabriv93 opened 3 years ago

Gabriv93 commented 3 years ago

Hello, I'm using fast text fro research project, using Unsupervised word embedded Alignments for comparing two different corpus of English language texts, labelled with "-a" and "_b" (for example, in corpus a the word "Commission" is called "Commission_a", while in corpus B is called "Commission_b") using a Lexicon of 100 safe correspondences. Corpus A counts 13308 words, while Corpus B counts 8237 words. Using the algorithm of "unsup_align.py" in a sh script like this:

`echo "Unsupervised Example based on the 1-dimension clusters alignment"

if [ ! -d Aligned_Models/ ]; then mkdir -p Aligned_Models; fi

lexicon=./feedbacks/7937377/analysis/lexab.txt model_src=Embedding_Models/corpus_a_clean.txt model_tgt=Embedding_Models/corpus_b_clean.txt output_src=Aligned_Models/corpus_a_clean_uns-aligned output_tgt=Aligned_Models/corpus_b_clean_uns-aligned

python3 ../../GitHub/fastText/alignment/unsup_align.py --model_src "${model_src}" --model_tgt "${model_tgt}" --lexicon "${lexicon}" \ --output_src "${output_src}" --output_tgt "${output_tgt}" --lr 25 --niter 10`

Running the script, after a successful computation of the initial mapping with convex relaxation, it raises an index error in the matrix corresponding of the first matrix:

`Unsupervised Example based on the 1-dimension clusters alignment

Wasserstein Procrustes

Loading vectors from Embedding_Models/corpus_a_clean.txt 13308 word vectors loaded Loading vectors from Embedding_Models/corpus_b_clean.txt 8237 word vectors loaded Coverage of source vocab: 1.0000

Computing initial mapping with convex relaxation... 6.556953872272116 Done [180 sec]

Computing mapping with Wasserstein Procrustes... Traceback (most recent call last): File "../../GitHub/fastText/alignment/unsup_align.py", line 98, in nepoch=args.nepoch, reg=args.reg, nmax=args.nmax) File "../../GitHub/fastText/alignment/unsup_align.py", line 45, in align xt = X[np.random.permutation(nmax)[:bsz], :] IndexError: index 16926 is out of bounds for axis 0 with size 13308`

What's going wrong? Why is trying to access index 16926 which is obviously out-of-bounds for both the Corpus? Thank you for your help