Improving NER - Githubissues

Gautam-Rajeev commented 9 months ago

Goal :

We want to improve our NER model to include entitites that come out of this

The overall goal is being able to extract any relevant entity (recognize its that entity) from a question that will help us with a search.

Current state :

The model is created based on this dataset using this code.

next steps are to extract the data from the pdf/csv, create the sentences in the required format (same as the dataset above) and then train the model. 30k Queries provided in the other ticket can also be used for the same.

Some pre-decided entities are also here :

Entity type	Description	example	Supported languages - ISO 639-1 code	Model type
Time	Detect time from given text.	tomorrow morning at 5am	'en', 'hi', 'or'	Simple regex
Date	Detect date from given text	next monday, agle somvar	'en', 'hi', 'or'	Regex + neighbor word
Number	Detect number and respective units in given text	50 rs per person	'en', 'hi', 'or'	Simple regex
Phone number	Detect phone number in given text	9833530536	'en', 'hi', 'or'	Regex
Email	Detect email in text	chakshu@samagragovernance.in	'en', 'hi', 'or'	Regex
Text	Detect custom entities in text string using full text search in Datastore. How much dosage of imidacloprid is required?		'en', 'hi', 'or'	search
regex	Detect entities using custom regex patterns	My flight PNR is 4SGX3E	NA	Regex
Location	Detect district, village, block and connect back to geoIP API	Is this available in Kollam ?	'en', 'hi', 'or'	Distilbert + search on known places
scheme names	Detect schema names	Am I eligible for PM-Kisan	'en','hi','or',	Search for custom name
Symptom of pest disease		My leaf is withering and yellowing	'en','hi','or',	distilbert
crop name	Detect the name of any crop	My paddy is withering	'en','hi','or',	distilbert + search
pest name	Detect the name of pest attacking a crop	How to deal with aphids	'en','hi','or',	distilbert + search

We need to have a common model that is able to detect all these entity types. We should be able to input a sentence and get back the entities detected for the sentence.

basedsaksham commented 8 months ago

@GautamR-Samagra hi, can I please get the access to the datasets. I'd like to make some contributions to this issue at hand.

Gautam-Rajeev commented 8 months ago

HI @basedsaksham, the idea is to treat NER as a model that does multiple things under the hood :

It can be a seq-seq NER model based on this dataset. Have written some code to train such models here, can use that as a starting point.
It could be simple regex based operation to get other entities out. e.g. email can be recognized by @ followed by domain name. I want a repo that combines all these.

As an input I just pass an argument which lists the entities I want to extract, the model uses either regex or the seq-seq model to extract the above entities.

basedsaksham commented 8 months ago

hey @GautamR-Samagra I have actually written a code which is detecting the required entities such as time ,email, phone number, number and unit and also predicting the extracted time(both in hindi and english) using regex. https://colab.research.google.com/drive/1DAg0xKBYMnXXcQzFwBK2aQppoyoddj1X?authuser=1#scrollTo=6qNzUpRqjnrS this what I have done so far. I am working on integrating all of the above mentioned things in the issue

Gautam-Rajeev commented 8 months ago

hey @GautamR-Samagra I have actually written a code which is detecting the required entities such as time ,email, phone number, number and unit and also predicting the extracted time(both in hindi and english) using regex. https://colab.research.google.com/drive/1DAg0xKBYMnXXcQzFwBK2aQppoyoddj1X?authuser=1#scrollTo=6qNzUpRqjnrS this what I have done so far. I am working on integrating all of the above mentioned things in the issue

unable to open it

basedsaksham commented 8 months ago

hey @GautamR-Samagra I have actually written a code which is detecting the required entities such as time ,email, phone number, number and unit and also predicting the extracted time(both in hindi and english) using regex. https://colab.research.google.com/drive/1DAg0xKBYMnXXcQzFwBK2aQppoyoddj1X?authuser=1#scrollTo=6qNzUpRqjnrS this what I have done so far. I am working on integrating all of the above mentioned things in the issue

unable to open it

please try now

adityathenerd commented 8 months ago

HI @basedsaksham, the idea is to treat NER as a model that does multiple things under the hood :

It can be a seq-seq NER model based on this dataset. Have written some code to train such models here, can use that as a starting point.

It could be simple regex based operation to get other entities out. e.g. email can be recognized by @ followed by domain name. I want a repo that combines all these.

As an input I just pass an argument which lists the entities I want to extract, the model uses either regex or the seq-seq model to extract the above entities.

Hey @GautamR-Samagra , worked on the NER notebook that you had given and tried to add-on crop_symptoms to it along with crop_name and crop_disease. https://colab.research.google.com/drive/1SbbM0UG18a65mrFnILMQdquLS5BB_0e5?usp=sharing This is what I have done so far. Let me know how to proceed.

Gautam-Rajeev commented 8 months ago

@adityathenerd let me know if you were able to fix issues with it. still seeing

Gautam-Rajeev commented 8 months ago

@adityathenerd and @basedsaksham you haven worked on separate aspects of it. @adityathenerd on the ner model using distilbert and @basedsaksham on the regex part of it.

We must integrate both parts of it into a new module ner -->agri_ner inside ai-tools.

Proposed folder structure

The structure should mirror existing model setup such as that for text classification but with extra files for each kind ner we do

Folder structure can look like this :

ai-tools/
└── ner
    └── agri_ner
        ├── Dockerfile
        ├── README.md
        ├── api.py
        ├── model.py
        ├── request.py
        ├── bert_nert.py
        ├── regex_parse_ner.py
        └── lookup_ner.py

The request.py defines the request that can be given through the api file will serve as the entry point for the NER functionality. It should have the folowing arguments at least :
- sentence: input sentences for which the entities have to be recognized
- ner_entities: a list of entity types to be recognized. Example values could include ['crop_name', 'pest_name', 'email', 'numbers']. Additionally, an 'all' parameter should be supported to indicate that all available entity types should be recognized.
model.py should pull the 'models' in the other files- bert_ner, regex_parse_ner and lookup_ner to combine to get all required entities for 'sentence'

Do collaborate with each other and make a PR to ai-tools on this.

regex NER here

adityathenerd commented 7 months ago

@adityathenerd let me know if you were able to fix issues with it. still seeing

Hey @GautamR-Samagra , found out what the problem was with this. The dataset didnt have enough pest related tags, so model was not able to predict those well. Working on adding 7-8 more pest related sentences to the dataset. it should work fine now. Will update by EOD.

adityathenerd commented 7 months ago

Model Link Check now @GautamR-Samagra .

Gautam-Rajeev commented 7 months ago

@Shubh-Goyal-07 can you link your PR here

Shubh-Goyal-07 commented 7 months ago

The PR for the same has been made here: https://github.com/Samagra-Development/ai-tools/pull/317

@GautamR-Samagra

Samagra-Development / ai-tools

Improving NER #294

Goal :

Current state :

Proposed folder structure