Closed ChakshuGautam closed 9 months ago
@ChakshuGautam please provide a little bit more information about the issue. I'd like to work on it.
@ahsmha Added more details to it and assigned it to you.
sir if the issue isn't resolved can you assign me this issue ?:)
@ChakshuGautam I'm not working on it anymore, you can assign it to @Yogesh-7523
Adding more test cases. Some of them are bad because these were auto-generate. Need to clean this one. Looking for support here.
@ChakshuGautam could you clarify what kind of cleaning is required?. I can work on it. I went through the data and one thing I noticed is that the format is not uniform. If you give me a format I can try to generate/clean the data in that format.
@ChakshuGautam please have a look. training.txt Test.txt
@rishav-eulb this looks fine. Can you raise a PR for +ve test cases and training data? @GautamR-Samagra will review this. Thanks.
@ChakshuGautam I have raised PR for training data
@ChakshuGautam I have added more +test cases and raised PR, please review. I have tried to add varied examples to train edge cases also.
Hey @rishav-eulb, I am not seeing your PR. Can you share here?
Raise a PR directly. Great work!!!
The 'en_coreference_web_trf' spacy model used in src/coref/spacy/local/model.py is not supported I suggest we use 'en_core_web_sm' instead of this
@ChakshuGautam I cleaned the coref.text This is the cleaned version of it coref_text_cleaned.txt
I have done cleaning regarding those examples where input and output are same . example:
Output: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali?
The correct version be like it
Output: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali?
There were multiple examples like it. There were also some examples where i have to rephrase the sentence because it was not according to the pattern of test case we want.
Hey @ItshMoh Are you passing the Input through model.py? I'm having some trouble with the file. I'm getting output to be same as input on passing it through model.py
@Jiya126 I have not passed the input through model.py. I have just cleaned the coref.text as it contains some error in test cases. can you share the code when you are running this file in model.py and how are you calling the inference method. it will be very helpful.
import spacy
class Model:
def inference(self):
text = "What is the recommended planting depth for carrots? How far apart should I plant them"
nlp = spacy.load("en_core_web_trf")
doc = nlp(text)
offset = 0
reindex = []
for chain in doc.spans:
for idx, span in enumerate(doc.spans[chain]):
if idx > 0:
reindex.append([span.start_char, span.end_char, doc.spans[chain][0].text])
for span in sorted(reindex, key=lambda x: x[0]):
text = text[0:span[0] + offset] + span[2] + text[span[1] + offset:]
offset += len(span[2]) - (span[1] - span[0])
return {"text": text}
model = Model()
result = model.inference()
print(result)
I'm using this for testing the model.py inference function, but it is returning the output to be {'text': 'What is the recommended planting depth for carrots? How far apart should I plant them'}
Here, the doc.spans is not identifying any entities, nor does the doc.ents
I have some doubts regarding C4GT. I want to submit my proposal on this project but multiple people are already working on it. Can I still select this project for proposal or I have have to for for another. And how many people will be selected as contributors per project?
I have done cleaning regarding those examples where input and output are same . example: 5. Input: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali?
Output: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali?
The correct version be like it
- Input: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about it?
Output: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali?
There were multiple examples like it. There were also some examples where i have to rephrase the sentence because it was not according to the pattern of test case we want.
I think it makes sense to have some test cases where nothing needs to be done as coreference because the last sentence is already 'coreferenced'. You can leave these as they are.
I have some doubts regarding C4GT. I want to submit my proposal on this project but multiple people are already working on it. Can I still select this project for proposal or I have have to for for another. And how many people will be selected as contributors per project?
The current solution is just a baseline model built for testing and is not currently used in any product as it's not working for all our test cases. We are instead using GPT for the same. We want to use a non-GPT model (for speed reasons) and get its accuracy to a high level ( ~99%) as its a foundational block for all conversational bots ( it allows to play with only the last message and not the entire history). Spacy seems to allow us to finetune the model with our own data and if that helps us to pass all test cases, then this approach is valid and very useful for us. Otherwise, any approach that is able to carry out coreference correctly works.
This project will only have 1 person selected as contributor.
There seem to be package dependency issues in the current setup. The below setup works- link @Jiya126 your example is also included here
I have some doubts regarding C4GT. I want to submit my proposal on this project but multiple people are already working on it. Can I still select this project for proposal or I have have to for for another. And how many people will be selected as contributors per project?
The current solution is just a baseline model built for testing and is not currently used in any product as it's not working for all our test cases. We are instead using GPT for the same. We want to use a non-GPT model (for speed reasons) and get its accuracy to a high level ( ~99%) as its a foundational block for all conversational bots ( it allows to play with only the last message and not the entire history). Spacy seems to allow us to finetune the model with our own data and if that helps us to pass all test cases, then this approach is valid and very useful for us. Otherwise, any approach that is able to carry out coreference correctly works.
This project will only have 1 person selected as contributor.
I have done cleaning regarding those examples where input and output are same . example: 5. Input: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali? Output: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali? The correct version be like it
- Input: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about it?
Output: Q: What is the highest peak in North America? A: The highest peak in North America is Denali, also known as Mount McKinley. Q: Can you share some interesting facts about Denali? There were multiple examples like it. There were also some examples where i have to rephrase the sentence because it was not according to the pattern of test case we want.
I think it makes sense to have some test cases where nothing needs to be done as coreference because the last sentence is already 'coreferenced'. You can leave these as they are.
ok sir
Just highlighting key issues from the call -
I'm interested and looking forward to submit a proposal on this project.can please you please guide me ?
I'm interested and looking forward to submit a proposal on this project. can please you please guide me ?
Hi, Anything specific you want to know? A short summary would be that we are trying to improve the current implementation here which uses this . We need to train/fine-tune this using the approach discussed here
You can also follow the discussions in discord bot
I am interested in working on this project and submit a proposal. Can you guide me on what all are considered to be put in the proposal? The way how I plan to implement the Neural Coreference task for input sentences right? Or am I missing something. Please let me know and guide me
Hi, I have Submitted my proposal and looking forward to work on this project.
@Jiya126 please comment the PR you raised here. Great work! Converting .txt file to conll format required for training of the spacy model - here
@Jiya126 please comment the PR you raised here. Great work! Converting .txt file to conll format required for training of the spacy model - here
Raised the PR!!
@Jiya126 please comment the PR you raised here. Great work! Converting .txt file to conll format required for training of the spacy model - here
Raised the PR!!
Could you review the PR
Progress so far:
Going Forward:
Project Details
AI Toolchain is a collection of tools for quickly building and deploying machine learning models for various use cases. Currently, the toolchain includes a text translation model, and more models may be added in the future. It abstracts the dirty details of how a model works similar to Huggingface and gives a clean API that you can orchestrate at a BFF level.
Features to be implemented
Neural coreference resolution (overview) is a natural language processing (NLP) task that involves identifying when two or more words or phrases in a text refer to the same entity or concept. This process is crucial for understanding the context and meaning of a text, as it helps in resolving ambiguities and connecting different parts of a text.
For example: Non-coreferenced conversation: User : 'Can you tell me where are the shops for paddy seeds?' User : 'What is the price for them?'
Coreferenced conversation: User : 'Can you tell me where are the shops for paddy seeds?' User : 'What is the price for paddy seeds?'
How it works
Neural coreference resolution models usually involve various components, such as:
Feature Extraction:
Detecting the different parts of a sentence and their relationships with each other
Mention detection:
Identifying potential words or phrases (mentions) that may be involved in coreference relations (they all refer to the same entity)
Pairwise scoring:
Computing a score for each pair of mentions, representing the likelihood that they corefer.
Clustering:
Grouping mentions into clusters, where each cluster represents a single entity or concept.
Replacement:
For the last message in the conversation (or for all messages succeeding the first), replace the phrases in each cluster by the common word that gives the entire picture.
Deployment:
Deploy the above setup as a part of ai-tools package such that it can be dockerized and then accessed through an API setup.
Learning Path
Complexity
Hard
Skills Required
Python, NLP
Name of Mentors:
@GautamR-Samagra
Project size
8 Weeks
Product Set Up
See the setup here The setup here is just an example or baseline solution of how it could be carried out. We are currently focusing this project on using the Spacy model to carry out coreference. Exploration of other techniques is encouraged only if the this one fails.
Acceptance Criteria
Milestone
Reference
https://galhever.medium.com/a-review-to-co-reference-resolution-models-f44b4360a00
https://explosion.ai/blog/coref https://github.com/explosion/spaCy/discussions/11585#discussioncomment-3970887
C4GT
This issue is nominated for Code for GovTech (C4GT) 2023 edition. C4GT is India's first annual coding program to create a community that can build and contribute to global Digital Public Goods. If you want to use Open Source GovTech to create impact, then this is the opportunity for you! More about C4GT here: https://codeforgovtech.in/
The scope of this ticket has now expanded to make it the 'enabling conversation' part of 'FAQ bot'. The FAQ bot allows a user to be able to provide content input in the form on csvs, free text, pdfs, audio, video and the bot is able to add it to a 'Content DB'. The user is then able to interact with the bot via text/speech on related content and the bot is able to identify relevant content using RAG techniques and be able to be able to respond to the user in a conversational manner.
This ticket covers the enabling conversation +small model deployment/finetuning pipeline. It includes the following tasks in its scope: