Samagra-Development / ai-tools

AI Tooling to bootstrap applications fast
41 stars 109 forks source link

Improving NER #294

Closed GautamR-Samagra closed 1 month ago

GautamR-Samagra commented 4 months ago

Goal :

We want to improve our NER model to include entitites that come out of this

The overall goal is being able to extract any relevant entity (recognize its that entity) from a question that will help us with a search.

Current state :

The model is created based on this dataset using this code.

next steps are to extract the data from the pdf/csv, create the sentences in the required format (same as the dataset above) and then train the model. 30k Queries provided in the other ticket can also be used for the same.

Some pre-decided entities are also here :

Entity type Description example Supported languages - ISO 639-1 code Model type
Time Detect time from given text. tomorrow morning at 5am 'en', 'hi', 'or' Simple regex
Date Detect date from given text next monday, agle somvar 'en', 'hi', 'or' Regex + neighbor word
Number Detect number and respective units in given text 50 rs per person 'en', 'hi', 'or' Simple regex
Phone number Detect phone number in given text 9833530536 'en', 'hi', 'or' Regex
Email Detect email in text chakshu@samagragovernance.in 'en', 'hi', 'or' Regex
Text Detect custom entities in text string using full text search in Datastore. How much dosage of imidacloprid is required? 'en', 'hi', 'or' search
regex Detect entities using custom regex patterns My flight PNR is 4SGX3E NA Regex
Location Detect district, village, block and connect back to geoIP API Is this available in Kollam ? 'en', 'hi', 'or' Distilbert + search on known places
scheme names Detect schema names Am I eligible for PM-Kisan 'en','hi','or', Search for custom name
Symptom of pest disease My leaf is withering and yellowing 'en','hi','or', distilbert
crop name Detect the name of any crop My paddy is withering 'en','hi','or', distilbert + search
pest name Detect the name of pest attacking a crop How to deal with aphids 'en','hi','or', distilbert + search

We need to have a common model that is able to detect all these entity types. We should be able to input a sentence and get back the entities detected for the sentence.

basedsaksham commented 3 months ago

@GautamR-Samagra hi, can I please get the access to the datasets. I'd like to make some contributions to this issue at hand.

GautamR-Samagra commented 3 months ago

HI @basedsaksham, the idea is to treat NER as a model that does multiple things under the hood :

  1. It can be a seq-seq NER model based on this dataset. Have written some code to train such models here, can use that as a starting point.
  2. It could be simple regex based operation to get other entities out. e.g. email can be recognized by @ followed by domain name. I want a repo that combines all these.

As an input I just pass an argument which lists the entities I want to extract, the model uses either regex or the seq-seq model to extract the above entities.

basedsaksham commented 3 months ago

hey @GautamR-Samagra I have actually written a code which is detecting the required entities such as time ,email, phone number, number and unit and also predicting the extracted time(both in hindi and english) using regex. https://colab.research.google.com/drive/1DAg0xKBYMnXXcQzFwBK2aQppoyoddj1X?authuser=1#scrollTo=6qNzUpRqjnrS this what I have done so far. I am working on integrating all of the above mentioned things in the issue

GautamR-Samagra commented 3 months ago

hey @GautamR-Samagra I have actually written a code which is detecting the required entities such as time ,email, phone number, number and unit and also predicting the extracted time(both in hindi and english) using regex. https://colab.research.google.com/drive/1DAg0xKBYMnXXcQzFwBK2aQppoyoddj1X?authuser=1#scrollTo=6qNzUpRqjnrS this what I have done so far. I am working on integrating all of the above mentioned things in the issue

unable to open it

basedsaksham commented 3 months ago

hey @GautamR-Samagra I have actually written a code which is detecting the required entities such as time ,email, phone number, number and unit and also predicting the extracted time(both in hindi and english) using regex. https://colab.research.google.com/drive/1DAg0xKBYMnXXcQzFwBK2aQppoyoddj1X?authuser=1#scrollTo=6qNzUpRqjnrS this what I have done so far. I am working on integrating all of the above mentioned things in the issue

unable to open it

please try now

adityathenerd commented 3 months ago

HI @basedsaksham, the idea is to treat NER as a model that does multiple things under the hood :

  1. It can be a seq-seq NER model based on this dataset. Have written some code to train such models here, can use that as a starting point.
  2. It could be simple regex based operation to get other entities out. e.g. email can be recognized by @ followed by domain name. I want a repo that combines all these.

As an input I just pass an argument which lists the entities I want to extract, the model uses either regex or the seq-seq model to extract the above entities.

Hey @GautamR-Samagra , worked on the NER notebook that you had given and tried to add-on crop_symptoms to it along with crop_name and crop_disease. https://colab.research.google.com/drive/1SbbM0UG18a65mrFnILMQdquLS5BB_0e5?usp=sharing This is what I have done so far. Let me know how to proceed.

GautamR-Samagra commented 3 months ago

@adityathenerd let me know if you were able to fix issues with it. still seeing image

GautamR-Samagra commented 3 months ago

@adityathenerd and @basedsaksham you haven worked on separate aspects of it. @adityathenerd on the ner model using distilbert and @basedsaksham on the regex part of it.

We must integrate both parts of it into a new module ner -->agri_ner inside ai-tools.

Proposed folder structure

The structure should mirror existing model setup such as that for text classification but with extra files for each kind ner we do

Folder structure can look like this :

ai-tools/
└── ner
    └── agri_ner
        ├── Dockerfile
        ├── README.md
        ├── api.py
        ├── model.py
        ├── request.py
        ├── bert_nert.py
        ├── regex_parse_ner.py
        └── lookup_ner.py

Do collaborate with each other and make a PR to ai-tools on this.

regex NER here

adityathenerd commented 3 months ago

@adityathenerd let me know if you were able to fix issues with it. still seeing image

Hey @GautamR-Samagra , found out what the problem was with this. The dataset didnt have enough pest related tags, so model was not able to predict those well. Working on adding 7-8 more pest related sentences to the dataset. it should work fine now. Will update by EOD.

adityathenerd commented 3 months ago

Model Link Check now @GautamR-Samagra .

GautamR-Samagra commented 2 months ago

@Shubh-Goyal-07 can you link your PR here

Shubh-Goyal-07 commented 2 months ago

The PR for the same has been made here: https://github.com/Samagra-Development/ai-tools/pull/317

@GautamR-Samagra