AI4Bharat / IndicLID

Language Identification for Indian languages
12 stars 4 forks source link

idea: use GPT-4 for synthetic data #1

Open tfriedel opened 1 year ago

tfriedel commented 1 year ago

Hi! Cool project, I was looking for some language identification for romanized indian languages. The performance doesn't seem to be as high as for other languages (i.e. doesn't approach 99%). This must surely be because of the data used. One idea would be to try gpt-4 for translating between non-romanized and romanized text. Have you considered this? Sure it's expensive, but you may ask for free credits from OpenAI for such a good cause. Of course you'd first need to see if this works reasonably well.