Using LASER 2 and LASER 3 to filter good quality sentence pairs from a webcrawled parallel dataset

vmenan commented 1 year ago

Hi, For my research in low resource languages native for sri lanka (sin_Sinh and tam_Taml) we are following the great work done in NLLB paper by Facebook research. Apart from the nllb mined dataset, we have found few more crawled parallel corpra (eg: eng_Latn-sin_Sinh). Since we are involving translators to translate and increase the quality of these parrallel corpra, I am tasked to filter the sentence pairs which may have already a good quality alignment. For this, i thought of following these steps for the example of eng_Latn-sin_Sinh,

Embed eng_Latn using LASER 2 and embed sin_Sinh using LASER 3 (encoder specific for Sinhala)
Use xsim to get the alignment score for the sentence pair.
Filter sentence pairs above a specified threshold(may be the one used in NLLB paper).

First foremost, are the steps im following sound good? To do this im struggling in the (1). Is there a way to import LASER 2 and LASER 3 models into python script or jupyter notebook? Like from huggingface hub(i did visit embed.py in source folder, it seems to have a number of preprocessing steps involved so wanted see whether there was a pipeline which did all these. Since my datasets are small, its much easier to build these above mentioned steps in a notebook and test it. Would appreciate any help regarding this. Thank you in advance!

heffernankevin commented 1 year ago

Hi @vmenan! I think your steps sound reasonable, but perhaps xsim is not the best tool for this. Another option here could be to use a cosine score between each English and Sinhala pair? (with a threshold you deem best based on your data).

Re: embedding LASER2 and 3 models, the best way currently is to use the embed.sh script here. This will take care of all the necessary preprocessing steps. Since your datasets are small, you can just embed these once using the command-line and then load the embeddings into your notebook to calculate the cosine scores.

Hope this helps !!

vmenan commented 1 year ago

Hi @heffernankevin, thank you very much for your reply. The reason why i wanted to go with xsim was because NLLB paper used xsim get the alignment score, since the dataset i have is also from paracrawl and opencrawl, so the parrallel sentences arent perfect, so thought xsim would be better, but you are right, i will try cosine similarity as well.

Re: "embedding LASER2 and 3 models, the best way currently is to use the embed.sh script here." Understood, i thought the pipeline was designed having monolingual datasets in mind, since i have the parralel sentences i thought i can by pass few of the steps.

Another thought will be to import embed.py, which the embed.sh utilizes, but there are few classes to load the encoders, eg:"LaserLSTMEncoder", "LaserTransformerEcoder". Not sure which class to import, if i were to do this manually.

I will try embed.sh and keep this thread posted, once again thank you for your quick response!

heffernankevin commented 1 year ago

Hi @vmenan, yes embed.sh for your use case might be easier. It is designed for monolingual datasets, so I would recommend you first split the dataset into two files (of the same length) for English <> Sinhala. For example, if you have a .tsv file with two columns (English, Sinhala): cat alignment_file | awk 'BEGIN{FS="\t"}{print $1} > eng_Latn.mono; cat alignment_file | awk 'BEGIN{FS="\t"}{print $2} > sin_Sinh.mono;

vmenan commented 1 year ago

Hi @heffernankevin thank you so much for your guidance. Yes i was able to create .mono files for english and sinhala. I was successfully able to get embeddings for English. But for sinhala im getting an runtime error "Mask Type should be defined". English worked well(may be because its part of LASER2) but for sinhala i had to download a LASER3 encoder specific for sinhala. To get embedding i used the following command english : ./embed.sh [INPUT FILE] [OUTPUT FILE] sinhala: ./embed.sh [INPUT FILE] [OUTPUT FILE] sin_Sinh are there any other arguments i need to send in?

And also when i downloaded laser2 using the download.sh there were three files which got downloaded ".pt",".spm" and ".cvocab" but for sinhala there was only a zip file with ".pt" version. Is that okay?

heffernankevin commented 1 year ago

@vmenan what version of fairseq are you using? I might recommend version 0.12.1 e.g. pip install fairseq==0.12.1

vmenan commented 1 year ago

@heffernankevin you are right. Finally got it working. The following are my observetions

fairseq version 0.12.1 gave error when installing for python 3.10.6
fairseq version 0.12.1 gets installed in python 3.8 without any errors, but it install torch gpu dependencies as well.
faced cuda memory error, this is due to my gpu having low memory. Was able to fix this by changing the self.use_cuda = cpu in the class SentenceEncoder of the embed.py.

Successfully was able to get the embedding. I really appreciate your help in this

heffernankevin commented 1 year ago

@vmenan great !! Another option instead of altering the python code is to add the argument: --cpu to the embed.sh script you've been using.

vmenan commented 1 year ago

Brilliant, that should keep the python code intact. Thank you very much @heffernankevin for your guidance!

facebookresearch / LASER

Using LASER 2 and LASER 3 to filter good quality sentence pairs from a webcrawled parallel dataset #229