bitextor / bicleaner-ai

Bicleaner fork that uses neural networks
GNU General Public License v3.0
38 stars 4 forks source link

[Usability Question] Low resource language - Abkhazian #16

Closed danielinux7 closed 2 years ago

danielinux7 commented 2 years ago

Hello, I need your advice on usability of bicleaner ai in my situation.

I have around 100k Abkhazian-Russian parallel corpus with 85% accuracy rate 85% (15% is wrong translation).

I have 1.2 million monolingual Abkhazian sentences.

I have an Abkhazian-Russian automatic translator with 15% accuracy rate (out of 100 sentences, only 15 are translated correctly.)

Is it possible to build an effective bicleaner, what can I expect?

Best Regards, Nart

ZJaume commented 2 years ago

Hi @danielinux7,

Extracting about 10k or 20k sentences that you know most are correct from that corpus should be enough to build a decent Bicleaner AI full model. But, if your plan is to clean with that model a low-resource corpus that is 85% good sentences, the impact on machine translation might be very little or even negative. Bicleaner AI shines more on data that has significant larger amounts of noise. That said, you could try to train it. Then score the corpus and manually inspect the scores to see if it's possible to combine them with handcrafted heuristics.

Other things you could do to improve the model:

--save_train_data SAVE_TRAIN_DATA: Save the generated dataset into a file. If the file already exists the training dataset will be loaded from there. (default: None)

Best, Jaume

danielinux7 commented 2 years ago

@ZJaume you feedback is appreciated.

danielinux7 commented 2 years ago

@ZJaume I have a question. My understanding is that a fine tuned XLMRoberta is being used for bicleaner-ai The Abkhazian language is not part of XLMRoberta, but I can still tokenize with their SPM, it does too much splitting. Also it seems XLMRoberta tokenizes raw text directly no mention of sacremoses in their paper. I also managed to further clean up the parallel corpus, there is 72k Abkhazian-Russian sentences with ~95% accuracy rate.

Can I still train a decent bicleaner-ai in this situation?

ZJaume commented 2 years ago

Abkhazian could work, we've been using Maltese model (which is also not in XLMR) and worked pretty decent.

95% of accuracy is very good for training data for Bicleaner AI. But, as far as I understood, your 72k corpus is the only resource you have? What is it exactly your purpose after training Bicleaner AI model? If you don't have another data available to filter, I don't know how can you improve your MT system.

EDIT: if you see that Abkhazian words are being splitted with SPM, you could increase the maximum length of the model to avoid truncating the sentences too early.

danielinux7 commented 2 years ago

your 72k corpus is the only resource you have?

No, I have around 100k-300k, but it's noisy, I had to distill it down to 72k, in order to get the 95% accuracy. There are more resources that I didn't tap in yet, more eBooks and websites. I also have 1.2m clean monolingual corpus.

What is it exactly your purpose after training Bicleaner AI model?

  1. I need a tool to help me in filtering current and new text.
  2. I am hoping to use it with back translation, if I could reduce noise in the synthetic corpus, that could improve the MT.

to avoid truncating the sentences too early.

Thanks for the heads up.

ZJaume commented 2 years ago

In case you don't know, there is Bitextor and Bitextor neural for harvesting parallel corpora from websites. I imagine that for a language like Abkhazian that has poor support in NLP tools, Bitextor neural would be nice because it does very little language-specific operations and uses multilingual models for alignment.

danielinux7 commented 2 years ago

there is Bitextor and Bitextor neural for harvesting parallel corpora from websites

That is definitely something I should be looking at.

danielinux7 commented 2 years ago

@ZJaume I have few questions regarding training on Kaggle.

  1. The notebook stops because it runs out of memory, looking at the GPU status, it doesn't seem it's utilizing it. Any idea why is that?
  2. What is the parameter that I should use to increase the maximum length of the model?
  3. I can only train for a 9 hour session in Kaggle, can I pick up training every 9 hours or so?
danielinux7 commented 2 years ago

@ZJaume Can you help in training a bicleaner-ai for Abkhazian-Russian, then we could share it here. I am lacking resources for training, you also did this a zillion times, so you should get it right instantly.

I can provide you with the parallel corpus that we have. I think I can generate a synthetic corpus of 1.4 million sentences with 25-30% accuracy, I would really like to try the bicleaner-ai for abkhazian and see the results.

danielinux7 commented 2 years ago

I have trained a model on Kaggle, The results do correlate, this tool actually works! Now I have to enhance it's accuracy. Here is a link to the kaggle notebook, if someone else wants to train on kaggle.

ZJaume commented 2 years ago

Hi, the slowness could probably be because the TensorFlow version didn't match the CUDA version installed. If you need to check it, run bicleaner-ai with TF_CPP_MIN_LOG_LEVEL=0 bicleaner-ai-train and the TF logs will tell you if it's able to find the CUDA libraries. If not, it runs anyway but with CPU, so that can explain the slowness.

Respect to train Bicleaner models, if you want me to train models for you we can give you that service at Prompsit. If you are interested please email me at jzaragoza@prompsit.com and give me some info of who you are and in what project are you working.

danielinux7 commented 2 years ago

I'll keep that in mind, thank you.