ccasimiro88 / TranslateAlignRetrieve

Python-based implementation of the Translate-Align-Retrieve method to automatically translate the SQuAD Dataset to Spanish.
MIT License
60 stars 14 forks source link

Testing multilingual branch (EN-DA) #14

Closed jacob-hein closed 3 years ago

jacob-hein commented 3 years ago

Hello Casimiro,

I've been testing the most recently updated multilingual branch and ran into many issues due running the code on a Windows 10 machine. However, I eventually managed to install all the dependencies and tested the translation script on the dev-v1.1 dataset by running:

python src/retrieve/translate_squad.py --squad_file C:\Users\jacob\TranslateAlignRetrieve\src\tar\corpora\squad-en\dev-v1.1.json --output_dir C:\Users\jacob\squad-da --lang_target da --overwrite_cached_data --answers_from_alignment

Although I had to alter translate_squad.py slightly as my machine is not capable of utilizing NVIDIA/CUDA GPU processing, it seems to run as intended:

test_en_-_da

For the dev-v1.1 the script requires ~4500 iterations of which I got to 1 % in about 6 minutes. Assuming the same CPU computation speed throughout, that means roughly 10 hours of running time for translation of dev-v1.1 dataset. For the train-v1.1 ~32000 iterations were required amounting to just under 3 days of computation time for my CPU.

I'm curious of your machines running speed on translate_squad.py for any of the SQuAD datasets when utilizing your GPU. As I lack access to a GPU machine, I will test running translate_squad.py through a Google Colab GPU-enabled notebook and report back.

All the best, Jacob

jacob-hein commented 3 years ago

Hi again.

From testing translate_squad.py in a Google Colab GPU-enabled notebook, I find dev-v1.1 is translated from en-da in roughly 4 hours.

Unfortunately I'm unable to run the /compute_alignment.sh script with the subprocess.run Python command in translate_squad.py through Google Colab. I'm unable to run it because shell commands are invoked with !cmd in a Google Colab cell environment and therefore struggling to find a work around, as the shell command running /compute_alignment.sh is run inside translate_squad.py.

Sorry if this is confusing to you. I was wondering if you would consider making translate_squad.py more flexible for non-Linux users. Specifically, if there is a way to avoid running /compute_alignment.sh but instead running align.py from within translate_squad.py - that may prove very valuable to some users.

All the best, Jacob

ccasimiro88 commented 3 years ago

Hi Jacob,

  1. About performance: Unfortunately, the translation in CPU is unfeasible for a dataset as large as the SQuAD training set. However, using a GeForce RTX 2080 with 12G RAM memory took almost 3 hours.

  2. About the non-Linux extensibility I need a python-based implementation of an alignment model to make the code Linux independent. I am currently using the eflomal alignment model since it proved to be one of the fastest and accurate. So, the alternative implementation should perform at least as good as eflomal, since the alignment component is crucial for the TAR method. I will check out if there are some valid python solutions.

In the meanwhile, to provide you with the datasets asap, I am going to generate the SQuAD-da train and dev sets and push them to the repo.

Regards, Casimiro

jacob-hein commented 3 years ago

You're a life saver Casimiro - I'm not sure how to properly thank you..

I've been stuck for a while getting the eflomal module to work in my Google Colab notebook environment. I seem to be unable to load the module attributes correctly: Google Colab: image

I was however able to install the eflomal module in a virtual env on my Windows 10 machine using the Ubuntu Terminal. And below all the module attributes loads just fine: Windows 10: image

Unfortunately, my Windows 10 machine is unfeasible for translating of these datasets through just the CPU.

So I suppose a GPU-enabled Linux machine is preferable for this kind of work :-)

All the best and thanks again, Jacob

ccasimiro88 commented 3 years ago

Hi @jacobshein

No worries, I am happy my project turns out to be valuable for your work.

Here the translated datasets: squads-tar/da

Try to train a QA model and let me know how it goes!

Regards, Casimiro