MLM Analysis of ChatGPT using NLP

abhisheks008 commented 6 months ago

Deep Learning Simplified Repository (Proposing new issue)

:red_circle: Project Title : MLM Analysis of ChatGPT using NLP
:red_circle: Aim : The aim of this project is to analyze the MLM model using NLP methods.
:red_circle: Dataset : https://www.kaggle.com/datasets/pe4eniks/chatgpt-for-mlm
:red_circle: Approach : Try to use 3-4 algorithms to implement the models and compare all the algorithms to find out the best fitted algorithm for the model by checking the accuracy scores. Also do not forget to do a exploratory data analysis before creating any model.

📍 Follow the Guidelines to Contribute in the Project :

You need to create a separate folder named as the Project Title.
Inside that folder, there will be four main components.
- Images - To store the required images.
- Dataset - To store the dataset or, information/source about the dataset.
- Model - To store the machine learning model you've created using the dataset.
- requirements.txt - This file will contain the required packages/libraries to run the project in other machines.
Inside the Model folder, the README.md file must be filled up properly, with proper visualizations and conclusions.

:red_circle::yellow_circle: Points to Note :

The issues will be assigned on a first come first serve basis, 1 Issue == 1 PR.
"Issue Title" and "PR Title should be the same. Include issue number along with it.
Follow Contributing Guidelines & Code of Conduct before start Contributing.

:white_check_mark: To be Mentioned while taking the issue :

Full name :
GitHub Profile Link :
Email ID :
Participant ID (if applicable):
Approach for this Project :
What is your participant role? (Mention the Open Source program)

Happy Contributing 🚀

All the best. Enjoy your open source journey ahead. 😎

NiharJani2002 commented 6 months ago

Request: To Assign Me This Issue

--> Full Name: Nihar Mahesh Jani --> Github Profile Link: https://github.com/NiharJani2002 --> Email id: nihar.j@ahduni.edu.in --> Participant id: https://quine.sh/user/NiharJani2002

--> Approach for this Project :

Exploratory Data Analysis (EDA):

Load and clean the data: Address missing values or inconsistencies. Verify data quality and integrity. Understand the distribution of features: Visualize word frequencies, token lengths, and other relevant metrics. Explore potential correlations or patterns. Identify potential biases or limitations: Assess domain specificity or distributional biases. Consider ethical implications and responsible use.

Model Selection and Implementation:

Choose 3-4 suitable algorithms: Recurrent Neural Networks (RNNs): LSTM or GRU architectures for sequential modeling. Transformer-based models: BERT, DistilBERT, or ALBERT for masked language modeling. Contextual embedding models: Word2Vec or GloVe for word-level representations. Other potential options: Attention-based RNNs, CNN-LSTM hybrids. Implement each model: Leverage libraries like TensorFlow, PyTorch, or Hugging Face Transformers. Carefully consider hyperparameter tuning and optimization strategies.

Model Training and Evaluation:

Split the dataset: Training set for model learning. Validation set for hyperparameter tuning and model selection. Test set for final evaluation. Train each model: Monitor training progress and adjust hyperparameters as needed. Evaluate performance: Use accuracy scores as a primary metric. Consider additional metrics like precision, recall, F1-score, perplexity, or ROUGE scores based on specific analysis goals. Analyze error patterns to identify areas for improvement.

Comparison and Selection:

Compare accuracy scores and other relevant metrics across models. Consider model complexity, training time, and interpretability. Select the best-performing model based on a comprehensive assessment.

Additional Considerations:

Experiment with different hyperparameters and model architectures. Explore techniques like transfer learning or fine-tuning to leverage pre-trained models. Consider using attention mechanisms for better interpretability. Visualize model outputs and attention weights for qualitative insights. Incorporate error analysis and feedback loops to improve model performance iteratively. Document the process, findings, and conclusions rigorously for reproducibility and knowledge sharing.

-->What is your participant role? (Mention the Open Source program): SWOC S4

abhisheks008 commented 5 months ago

Issue assigned to you @NiharJani2002

NiharJani2002 commented 5 months ago

When I clone the DL-Simplified repo on my local pc, there are many sub-folders with different project names, Whom I Have to work on? Or I have to start new ?

@abhisheks008

abhisheks008 commented 5 months ago

You have to create your own project, you don't need to update in the existing ones.

AgrawalTitiksha commented 1 month ago

Full name :Titiksha Agrawal GitHub Profile Link : https://www.github.com/AgrawalTitiksha/ Email ID : agrawaltn2311@gmail.com Participant ID (if applicable): Approach for this Project : Conducting exploratory data analysis on a synthetic dataset of queries related to computer vision, preprocessing the text data using NLP techniques and BERT, building and comparing the performance of 3 DL algorithms viz; LSTM, CNN, and Transformer, and selecting the best-fitted algorithm based on accuracy scores to enhance MLM model effectiveness. What is your participant role? (Mention the Open Source program) GSSOC'24 Contributor

abhisheks008 commented 1 month ago

Assigned to you @AgrawalTitiksha

khushi-igupta commented 1 month ago

Hello @abhisheks008 I am a GsSOC'24 contributor and I would like to work on this issue. Can you please assign it to me?

abhisheks008 commented 1 month ago

@khushi-igupta already assigned to a contributor.

abhisheks008 / DL-Simplified