ChakshuGautam commented 1 year ago

Project Details

AI Toolchain is a collection of tools for quickly building and deploying machine learning models for various use cases. Currently, the toolchain includes a text translation model, and more models may be added in the future. It abstracts the dirty details of how a model works similar to Huggingface and gives a clean API that you can orchestrate at a BFF level.

Features to be implemented

Neural coreference resolution (overview) is a natural language processing (NLP) task that involves identifying when two or more words or phrases in a text refer to the same entity or concept. This process is crucial for understanding the context and meaning of a text, as it helps in resolving ambiguities and connecting different parts of a text.

For example: Non-coreferenced conversation: User : 'Can you tell me where are the shops for paddy seeds?' User : 'What is the price for them?'

Coreferenced conversation: User : 'Can you tell me where are the shops for paddy seeds?' User : 'What is the price for paddy seeds?'

How it works

Neural coreference resolution models usually involve various components, such as:

Feature Extraction:

Detecting the different parts of a sentence and their relationships with each other

Mention detection:

Identifying potential words or phrases (mentions) that may be involved in coreference relations (they all refer to the same entity)

Pairwise scoring:

Computing a score for each pair of mentions, representing the likelihood that they corefer.

Clustering:

Grouping mentions into clusters, where each cluster represents a single entity or concept.

Replacement:

For the last message in the conversation (or for all messages succeeding the first), replace the phrases in each cluster by the common word that gives the entire picture.

Deployment:

Deploy the above setup as a part of ai-tools package such that it can be dockerized and then accessed through an API setup.

Learning Path

Complexity

Hard

Skills Required

Python, NLP

Name of Mentors:

@GautamR-Samagra

Project size

8 Weeks

Product Set Up

See the setup here The setup here is just an example or baseline solution of how it could be carried out. We are currently focusing this project on using the Spacy model to carry out coreference. Exploration of other techniques is encouraged only if the this one fails.

Acceptance Criteria

[ ] Unit Test Cases
[ ] OpenAPI Spec/Postman Collection
[ ] Dockerfile for this module

Milestone

Setting up spacy
Identifying type and format of data required to train
Creating synthetic data/collecting data for training
Training the model based on collected data
Carry out automated testing for models / integrate training based on our test cases
New trained model based on trained data

Reference

https://galhever.medium.com/a-review-to-co-reference-resolution-models-f44b4360a00

https://explosion.ai/blog/coref https://github.com/explosion/spaCy/discussions/11585#discussioncomment-3970887

C4GT

This issue is nominated for Code for GovTech (C4GT) 2023 edition. C4GT is India's first annual coding program to create a community that can build and contribute to global Digital Public Goods. If you want to use Open Source GovTech to create impact, then this is the opportunity for you! More about C4GT here: https://codeforgovtech.in/

The scope of this ticket has now expanded to make it the 'enabling conversation' part of 'FAQ bot'. The FAQ bot allows a user to be able to provide content input in the form on csvs, free text, pdfs, audio, video and the bot is able to add it to a 'Content DB'. The user is then able to interact with the bot via text/speech on related content and the bot is able to identify relevant content using RAG techniques and be able to be able to respond to the user in a conversational manner.

This ticket covers the enabling conversation +small model deployment/finetuning pipeline. It includes the following tasks in its scope:

[x] #174
[x] #175
[x] #176
[x] #177
[x] #178
[x] #179
[x] #180
[x] #181
[x] #182
[x] #183
[ ] Setting up pipeline for creating synthetic data, training models and storing them https://github.com/Samagra-Development/ai-tools/issues/144
[x] #184
[x] #201
[x] #202
[x] #203

ahsmha commented 1 year ago

@ChakshuGautam please provide a little bit more information about the issue. I'd like to work on it.

ChakshuGautam commented 1 year ago

@ahsmha Added more details to it and assigned it to you.

Yogesh-7523 commented 1 year ago

sir if the issue isn't resolved can you assign me this issue ?:)

ahsmha commented 1 year ago

@ChakshuGautam I'm not working on it anymore, you can assign it to @Yogesh-7523

ChakshuGautam commented 1 year ago

Adding more test cases. Some of them are bad because these were auto-generate. Need to clean this one. Looking for support here.

coref.test.txt

Dhruv88 commented 1 year ago

@ChakshuGautam could you clarify what kind of cleaning is required?. I can work on it. I went through the data and one thing I noticed is that the format is not uniform. If you give me a format I can try to generate/clean the data in that format.

rishav-eulb commented 1 year ago

@ChakshuGautam please have a look. training.txt Test.txt

ChakshuGautam commented 1 year ago

@rishav-eulb this looks fine. Can you raise a PR for +ve test cases and training data? @GautamR-Samagra will review this. Thanks.

rishav-eulb commented 1 year ago

@ChakshuGautam I have raised PR for training data

rishav-eulb commented 1 year ago

@ChakshuGautam I have added more +test cases and raised PR, please review. I have tried to add varied examples to train edge cases also.

ChakshuGautam commented 1 year ago

Hey @rishav-eulb, I am not seeing your PR. Can you share here?

rishav-eulb commented 1 year ago

+ve_test_case.txt training.txt

ChakshuGautam commented 1 year ago

Raise a PR directly. Great work!!!

Jiya126 commented 1 year ago

The 'en_coreference_web_trf' spacy model used in src/coref/spacy/local/model.py is not supported I suggest we use 'en_core_web_sm' instead of this

ItshMoh commented 1 year ago

@ChakshuGautam I cleaned the coref.text This is the cleaned version of it coref_text_cleaned.txt

ItshMoh commented 1 year ago

I have done cleaning regarding those examples where input and output are same . example:

Input: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali?

Output: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali?

The correct version be like it

Input: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about it?

Output: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali?

There were multiple examples like it. There were also some examples where i have to rephrase the sentence because it was not according to the pattern of test case we want.

Jiya126 commented 1 year ago

Hey @ItshMoh Are you passing the Input through model.py? I'm having some trouble with the file. I'm getting output to be same as input on passing it through model.py

ItshMoh commented 1 year ago

@Jiya126 I have not passed the input through model.py. I have just cleaned the coref.text as it contains some error in test cases. can you share the code when you are running this file in model.py and how are you calling the inference method. it will be very helpful.

Jiya126 commented 1 year ago

import spacy

class Model:
    def inference(self):
        text = "What is the recommended planting depth for carrots? How far apart should I plant them"
        nlp = spacy.load("en_core_web_trf")
        doc = nlp(text)
        offset = 0
        reindex = []
        for chain in doc.spans:
            for idx, span in enumerate(doc.spans[chain]):
                if idx > 0:
                    reindex.append([span.start_char, span.end_char, doc.spans[chain][0].text])

        for span in sorted(reindex, key=lambda x: x[0]):
            text = text[0:span[0] + offset] + span[2] + text[span[1] + offset:]
            offset += len(span[2]) - (span[1] - span[0])

        return {"text": text}

model = Model()
result = model.inference()
print(result)

I'm using this for testing the model.py inference function, but it is returning the output to be {'text': 'What is the recommended planting depth for carrots? How far apart should I plant them'}

Here, the doc.spans is not identifying any entities, nor does the doc.ents

ErShivam123 commented 1 year ago

I have some doubts regarding C4GT. I want to submit my proposal on this project but multiple people are already working on it. Can I still select this project for proposal or I have have to for for another. And how many people will be selected as contributors per project?

Gautam-Rajeev commented 1 year ago

I have done cleaning regarding those examples where input and output are same . example: 5. Input: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali?

Output: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali?

The correct version be like it

Input: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about it?

Output: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali?

There were multiple examples like it. There were also some examples where i have to rephrase the sentence because it was not according to the pattern of test case we want.

I think it makes sense to have some test cases where nothing needs to be done as coreference because the last sentence is already 'coreferenced'. You can leave these as they are.

Gautam-Rajeev commented 1 year ago

I have some doubts regarding C4GT. I want to submit my proposal on this project but multiple people are already working on it. Can I still select this project for proposal or I have have to for for another. And how many people will be selected as contributors per project?

The current solution is just a baseline model built for testing and is not currently used in any product as it's not working for all our test cases. We are instead using GPT for the same. We want to use a non-GPT model (for speed reasons) and get its accuracy to a high level ( ~99%) as its a foundational block for all conversational bots ( it allows to play with only the last message and not the entire history). Spacy seems to allow us to finetune the model with our own data and if that helps us to pass all test cases, then this approach is valid and very useful for us. Otherwise, any approach that is able to carry out coreference correctly works.

This project will only have 1 person selected as contributor.

Gautam-Rajeev commented 1 year ago

There seem to be package dependency issues in the current setup. The below setup works- link @Jiya126 your example is also included here

ItshMoh commented 1 year ago

I have some doubts regarding C4GT. I want to submit my proposal on this project but multiple people are already working on it. Can I still select this project for proposal or I have have to for for another. And how many people will be selected as contributors per project?

The current solution is just a baseline model built for testing and is not currently used in any product as it's not working for all our test cases. We are instead using GPT for the same. We want to use a non-GPT model (for speed reasons) and get its accuracy to a high level ( ~99%) as its a foundational block for all conversational bots ( it allows to play with only the last message and not the entire history). Spacy seems to allow us to finetune the model with our own data and if that helps us to pass all test cases, then this approach is valid and very useful for us. Otherwise, any approach that is able to carry out coreference correctly works.

This project will only have 1 person selected as contributor.

I have done cleaning regarding those examples where input and output are same . example: 5. Input: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali? Output: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali? The correct version be like it

Input: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about it?

Output: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali? There were multiple examples like it. There were also some examples where i have to rephrase the sentence because it was not according to the pattern of test case we want.

I think it makes sense to have some test cases where nothing needs to be done as coreference because the last sentence is already 'coreferenced'. You can leave these as they are.

ok sir

Gautam-Rajeev commented 1 year ago

Just highlighting key issues from the call -

We are focusing this project on the Spacy implementation detailed out in the setup More details at discussion link and blog
If the training of the Spacy model fails, we shall explore other options. For those writing proposals, we suggest spending around 70% of the proposal (implementation details part) on improving the spacy model itself

Shraddha063 commented 1 year ago

I'm interested and looking forward to submit a proposal on this project.can please you please guide me ?

Gautam-Rajeev commented 1 year ago

I'm interested and looking forward to submit a proposal on this project. can please you please guide me ?

Hi, Anything specific you want to know? A short summary would be that we are trying to improve the current implementation here which uses this . We need to train/fine-tune this using the approach discussed here

You can also follow the discussions in discord bot

Sakalya100 commented 1 year ago

I am interested in working on this project and submit a proposal. Can you guide me on what all are considered to be put in the proposal? The way how I plan to implement the Neural Coreference task for input sentences right? Or am I missing something. Please let me know and guide me

kaushalbsheth commented 1 year ago

Hi, I have Submitted my proposal and looking forward to work on this project.

Gautam-Rajeev commented 1 year ago

@Jiya126 please comment the PR you raised here. Great work! Converting .txt file to conll format required for training of the spacy model - here

Jiya126 commented 1 year ago

@Jiya126 please comment the PR you raised here. Great work! Converting .txt file to conll format required for training of the spacy model - here

Raised the PR!!

Jiya126 commented 1 year ago

@Jiya126 please comment the PR you raised here. Great work! Converting .txt file to conll format required for training of the spacy model - here

Raised the PR!!

Could you review the PR

ksgr5566 commented 1 year ago

Progress so far:

Calculated the accuracy of available test cases, for both spaCy and fcoref models. Updated here.
Since both fcoref's and spaCy's performance are comparable, and fcoref has a more clear documentation for fine-tuning, I have proceeded with setting up a pipeline for training the fcoref model on custom data and deploying it on hugging face. The colab gist is here.
Identified issue #188.
Trouble generating the data required to train fcoref. Look here for a better description.

Going Forward:

Continue as mentioned here.
Finetune a seq-to-seq model (ex: BART-squad) as a question answering model which takes in conversation as input, and outputs the final user's question modified by including contextual information required for the chatbot to generate an accurate answer. This is a possible solution for #188. If this works out, this issue too can be closed as neural coreference resolution would no longer required. [Generating training data for this is much simpler (point 1 in this comment) with GPT than asking it to accurately identify co-referring entity clusters]

Samagra-Development / ai-tools

[C4GT] Neural coreference for enabling conversational flow in bots #42

Project Details

Features to be implemented

How it works

Feature Extraction:

Mention detection:

Pairwise scoring:

Clustering:

Replacement:

Deployment:

Learning Path

Complexity

Skills Required

Name of Mentors:

Project size

Product Set Up

Acceptance Criteria

Milestone

Reference

C4GT