Verification Script - Githubissues

SuryaKrishna02 / maya-dataset-creation

The Repository contains the code for dataset creation for the Training the Maya: Multilingual Aya Model

MIT License

1 stars 1 forks source link

Verification Script #4

Closed SuryaKrishna02 closed 3 months ago

SuryaKrishna02 commented 4 months ago

Develop a Verification Script to verify the translations of the generated dataset from the c4ai-aya-23 model.

Statistics of the Translated sentences interms of length, POS tagging (Spacy), repeated words.
Based on the statistics, the verification script needs to find the sentences which are problematic and needs to be translated again.

Please feel free to add more thoughts on this to better find out the faulty translations from the model.

asusevski commented 4 months ago

Hi! I will be making a PR soon, in the meantime I wanted to discuss methodology. I want to add a few different methods, one of which would be to use an LLM as a judge to verify conciseness, consistency, etc on the translation. @Asnegha and I will also include statistical methods to flag problematic translations. Does this sound alright?

SuryaKrishna02 commented 4 months ago

@asusevski @Asnegha Sounds Great. Thanks for taking up work. I guess it is better to connect and discuss these over a call so that we can scope what we can do for Aya Expedition. Or else you can mention here the brief steps/methodologies you are planning to do.

asusevski commented 4 months ago

Tasks to be completed by over the next week to close out this issue:

implement chrf++ score
- apply to back-translations and verify distribution (find threshold for bad translations)
verify repeated tokens in translation
part of speech tagging