fani-lab / LADy

LADy 💃: A Benchmark Toolkit for Latent Aspect Detection Enriched with Backtranslation Augmentation
Other
3 stars 3 forks source link

Adding a new tanslation model to the pipeline #66

Open farinamhz opened 6 months ago

farinamhz commented 6 months ago

In this phase, we plan to integrate an additional translation model to determine if the observed improvements are consistent across various translators or if there's potential for further enhancement.

farinamhz commented 6 months ago

[To be updated]

Based on my research on available choices for the translation model and what we discussed before in our previous issue for translation model (https://github.com/fani-lab/LADy/issues/24), there are some libraries that will give us the access to using Google Translate API which I expect to work better than the pretrained models.

I previously tested several libraries that claimed to offer access to the Google Translate API, but they were not successful. Here's a brief overview of my findings:

  1. The googletrans library eventually blocks access from a remote host after processing approximately 1,000 reviews.
  2. The translatepy library lacks a batch translation function, and sending individual requests for a large number of reviews seems impractical.

As a result, I continued my search and discovered the deep-translator library, which has proven to be effective with our toy dataset. I plan to further test this library with the full dataset versions.

In the meantime, I investigated to see if FLAN is a good option, and unfortunately, currently, FLAN doesn't suit our needs, particularly because of the backtranslation step that necessitates tokenization and handling of the reverse translation process. Although FLAN excels in translating into English, its capability to translate from English to other languages falls short, as noted by the authors in the paper. This limitation is attributed to the use of English-specific tokenizers.