Translation - Githubissues

hamedwaezi01 commented 1 year ago

Hi @reemjalaleddine, Here you can update your progress or findings or anything.

reemjalaleddine commented 1 year ago

Hi @reemjalaleddine, Here you can update your progress or findings or anything.

noted

reemjalaleddine commented 1 year ago

Today I worked on downloading the package manager mamba, it took me some time with some errors. Now am working on the toy data but having some issues will address them to hamed if not resolved, also I have some question for tomorrow.

hamedwaezi01 commented 11 months ago

Hi, I got the latest Jupyter notebook from Rim, and did run it for our messages. It works but slowly. The issue with speed has happened with the LADy project as well. I proposed a solution for them that utilized a library that optimized the inference (forward propagation). The matter with PyTorch and Tensorflow is that, they need to handle forward propagation as well as backpropagation. The CTranslate2 library converts the models to an optimized setting using several approaches (I haven't studied it completely yet, but it is minimal). One of the optimizations is using different quantizations, which basically change the weights' precision. The NLLB model downloaded from HuggingFace uses float32 quantization, and we forced the converter to use float32 as well to have no loss in precision. The conversion itself takes some time, depending on the model size. I used the 3.3B NLLB model and changed Rim's code to use the CTranslate2. Right now, I am running the inferences (forward and backward translations) for two languages, French and simplified Chinese.

hosseinfani commented 11 months ago

@hamedwaezi01 point me to the codeline by the link here

reemjalaleddine commented 11 months ago

My update so far:

1) After downloading mamba, Hamed explained to me the data set available with the translation aims needed. I ran a Python code to translate one sentence using HuggingFace translation model 'facebook/m2m100_418M'

pipe("This is my first text from english to arabic", forced_bos_token_id=pipe.tokenizer.get_lang_id(lang="ar"))

2) After successfully being able to translate it I have created a csv file with dummy conversation and performed the translation on several lines from English to several languages based on the library created

lang_library = { "fr": "French", "ar": "Arabic", "es": "Spanish" }

the code will iterate over each library and for each library the code will iterat over each message line and call the funtion to translate it as well as performing back-translation. backtranslated_messages=translation_pipe(translated_texts, trgt_lang_id, forced_bos_token_id=src_lang_id) followed by writing the results to a csv file.

3) Then performed a batch translation instead of line by line

length = 3 batch_size = 2

for batch_start in range(0, length, batch_size): batch_end = min(batch_start + batch_size, length) message_batch = original_txt["text"].iloc[batch_start:batch_end].tolist()

4) After performing the code successfully over dummy conversation, I ran the code over the toy data set but with very short iteration in batches due to my PC limitations also including the csv columns identification for each message. and handled the code to hamed to test it on the whole data set (as updated by hamed)

reemjalaleddine commented 11 months ago

Note that this week I will be more focused on my thesis defense taking into consideration that I also will be out of town for two days next week, unfortunately.

fani-lab / Osprey

Translation #37