Closed danielinux7 closed 2 years ago
Hi @danielinux7,
Extracting about 10k or 20k sentences that you know most are correct from that corpus should be enough to build a decent Bicleaner AI full model. But, if your plan is to clean with that model a low-resource corpus that is 85% good sentences, the impact on machine translation might be very little or even negative. Bicleaner AI shines more on data that has significant larger amounts of noise. That said, you could try to train it. Then score the corpus and manually inspect the scores to see if it's possible to combine them with handcrafted heuristics.
Other things you could do to improve the model:
save_train_data
option to append human annotated data, if you have it, to the training corpus and re-train the model.--save_train_data SAVE_TRAIN_DATA: Save the generated dataset into a file. If the file already exists the training dataset will be loaded from there. (default: None)
Best, Jaume
@ZJaume you feedback is appreciated.
@ZJaume I have a question. My understanding is that a fine tuned XLMRoberta is being used for bicleaner-ai The Abkhazian language is not part of XLMRoberta, but I can still tokenize with their SPM, it does too much splitting. Also it seems XLMRoberta tokenizes raw text directly no mention of sacremoses in their paper. I also managed to further clean up the parallel corpus, there is 72k Abkhazian-Russian sentences with ~95% accuracy rate.
Can I still train a decent bicleaner-ai in this situation?
Abkhazian could work, we've been using Maltese model (which is also not in XLMR) and worked pretty decent.
95% of accuracy is very good for training data for Bicleaner AI. But, as far as I understood, your 72k corpus is the only resource you have? What is it exactly your purpose after training Bicleaner AI model? If you don't have another data available to filter, I don't know how can you improve your MT system.
EDIT: if you see that Abkhazian words are being splitted with SPM, you could increase the maximum length of the model to avoid truncating the sentences too early.
your 72k corpus is the only resource you have?
No, I have around 100k-300k, but it's noisy, I had to distill it down to 72k, in order to get the 95% accuracy. There are more resources that I didn't tap in yet, more eBooks and websites. I also have 1.2m clean monolingual corpus.
What is it exactly your purpose after training Bicleaner AI model?
- I need a tool to help me in filtering current and new text.
- I am hoping to use it with back translation, if I could reduce noise in the synthetic corpus, that could improve the MT.
to avoid truncating the sentences too early.
Thanks for the heads up.
In case you don't know, there is Bitextor and Bitextor neural for harvesting parallel corpora from websites. I imagine that for a language like Abkhazian that has poor support in NLP tools, Bitextor neural would be nice because it does very little language-specific operations and uses multilingual models for alignment.
there is Bitextor and Bitextor neural for harvesting parallel corpora from websites
That is definitely something I should be looking at.
@ZJaume I have few questions regarding training on Kaggle.
@ZJaume Can you help in training a bicleaner-ai for Abkhazian-Russian, then we could share it here. I am lacking resources for training, you also did this a zillion times, so you should get it right instantly.
I can provide you with the parallel corpus that we have. I think I can generate a synthetic corpus of 1.4 million sentences with 25-30% accuracy, I would really like to try the bicleaner-ai for abkhazian and see the results.
I have trained a model on Kaggle, The results do correlate, this tool actually works! Now I have to enhance it's accuracy. Here is a link to the kaggle notebook, if someone else wants to train on kaggle.
Hi, the slowness could probably be because the TensorFlow version didn't match the CUDA version installed. If you need to check it, run bicleaner-ai with TF_CPP_MIN_LOG_LEVEL=0 bicleaner-ai-train
and the TF logs will tell you if it's able to find the CUDA libraries. If not, it runs anyway but with CPU, so that can explain the slowness.
Respect to train Bicleaner models, if you want me to train models for you we can give you that service at Prompsit. If you are interested please email me at jzaragoza@prompsit.com and give me some info of who you are and in what project are you working.
I'll keep that in mind, thank you.
Hello, I need your advice on usability of bicleaner ai in my situation.
I have around 100k Abkhazian-Russian parallel corpus with 85% accuracy rate 85% (15% is wrong translation).
I have 1.2 million monolingual Abkhazian sentences.
I have an Abkhazian-Russian automatic translator with 15% accuracy rate (out of 100 sentences, only 15 are translated correctly.)
Is it possible to build an effective bicleaner, what can I expect?
Best Regards, Nart