TAHIR0110 / ThereForYou

ThereForYou: Your mental health ally. Kai, our AI assistant, offers compassionate support. Track your mood trends, find solace in a secure community, and access crisis resources swiftly. We're here to empower your journey towards improved well-being, leveraging technology for a brighter tomorrow.
Other
65 stars 80 forks source link

Indian languages to English language conversion model #12

Open TAHIR0110 opened 1 month ago

TAHIR0110 commented 1 month ago

I have a NLP model that can predict if someone is in danger based on what they say(speech is converted to text then text is being analysed). However, it currently only works with English text. I'd like to make it work with Indian languages as well.

The problem is, my model needs English input. So, I need to convert text from any Indian language to English before feeding it into my model.

ypahaly commented 1 month ago

Google Translation API can cover most Indian languages, but given the nature of this problem, it might require higher accuracy in terms of context unique to each specific language. Dedicated models will likely be needed to address this.

Please assign me this issue under GSSoC '24.

Also, could you specify the level of accuracy you are aiming for in this context, and what will the size limit of the text be?

ypahaly commented 1 month ago

Also Google Translation API is free for 90 days only so if we are seeking to use that then we should probably go for API with higher limit like Microsoft Translator Text which has a hard limit of 500,000 / month which is more that sufficient for this usecase

ypahaly commented 1 month ago

@TAHIR0110

Blackphoenix-15 commented 1 month ago

I'm a gssoc contributor, since I want to work on this, I'd be grateful if @TAHIR0110 can assign me this problem to work on.

Blackphoenix-15 commented 1 month ago

1.)Data Collection and Preparation:

I'll use web scraping techniques or publicly available datasets to collect parallel data containing sentences in Indian languages and their English translations. Then, I'll use tokenization and basic text cleaning methods to prepare the data, removing punctuation, special characters, and unnecessary whitespace.

2.)Model Selection and Implementation:

For the machine translation model, I'll use frameworks like OpenNMT or MarianMT, which are well-suited for sequence-to-sequence translation tasks. After installing the chosen framework, I'll refer to the documentation to set up the translation model and configure its architecture and hyperparameters. The training process will involve feeding the prepared dataset into the model and iteratively refining it to improve translation quality.

3.)Evaluation and Testing:

To evaluate the model's performance, I'll use metrics such as the BLEU score or conduct human evaluation on sample translations. I'll test the model with inputs from Indian languages and assess the quality of the English translations. Based on the evaluation results, I'll fine-tune the model and iterate the training process as needed to achieve better translation accuracy.

This would be my approach, so please let me work on this.

TAHIR0110 commented 1 month ago

@Blackphoenix-15 assigned.

Blackphoenix-15 commented 1 month ago

Thanks a lot, can I expect a deadline to work on this, since I have my end-term examinations going on, I would appreciate if I get one. regards

TAHIR0110 commented 1 month ago

@Blackphoenix-15 end of this month works? if you want more time do let me know and yea good luck for your end-term exams!!

Blackphoenix-15 commented 1 month ago

Thanks a lot for the consideration:)), I'll definitely do it by the end of this month.

Shirapti-nath commented 1 month ago

Documentation part is missing id like to work on this project under your supervision kindly assign the task to me sir

gssoc

Hireath08 commented 1 month ago

Hi @TAHIR0110 I wish to work on this. Kindly assign me this issue

Blackphoenix-15 commented 1 month ago

@TAHIR0110 Thanks a lot for allowing me to focus on my end-semester exams, since they are over now, I'll devote my time and complete the assigned project by the deadline.

Blackphoenix-15 commented 1 month ago

I would also like to ask a minor doubt, since there are a variety of Indian languages, even Hindi has it's own variations, so up to what level should I water down the database, or should I include all the variations? Thanks.

TAHIR0110 commented 1 month ago

including all the variations would be appreciated, as the label is also upgraded to level 3.

Blackphoenix-15 commented 3 weeks ago

@TAHIR0110 I have made a prototype subjected to correction, I am attaching the Python file here. https://colab.research.google.com/drive/14WLUvzvge-OzI2_p2o5Fs_tvwLd29io8?usp=sharing This contains 2 codes, both of which are working.

Blackphoenix-15 commented 3 weeks ago

![Uploading Screenshot 2024-05-23 130607.png…]()

I am also attaching the screenshot of some tested words and sentences for translation, please look into it @TAHIR0110

Blackphoenix-15 commented 3 weeks ago

@TAHIR0110 I would like to create a pull request since I have solved the issue, please go through this once.