Samagra-Development / ai-tools

AI Tooling to bootstrap applications fast
44 stars 110 forks source link

Test out BERTTopic to get meaningful topic segmentations of a query dataset #291

Closed Gautam-Rajeev closed 6 months ago

Gautam-Rajeev commented 9 months ago

Goal:

Get an accurate list of topics (around 20 topics max) for an agri dataset of queries (has around 20k unique queries) using BERTTopic. Only the 'questioninEnglish' column is relevant for the analysis

Description

Be able to segregate the given dataset into topics using BERTTopics. The veracity of the clusters are difficult to measure and currently will have to be observed manually and verified. Any suggestions to measure this better are welcome

One can also use simple TF-IDF, Topic2vec or LDA if they form better clusters. The sentences are just one sentence questions, not a paragraph.

Implementation Details

It'll include the following :

Anyone is welcome to begin work on the ticket, it'll not be assigned to anyone in particular initially. One can ask doubts and provide solutions through comments. Relevant points and ticket will be assigned to the best PR raised.

Other links

Medium

Product Name

AI Tools

Organization Name

SamagraX

Domain

NA

Tech Skills Needed

Python, BERT, ML

Category

Feature

Mentor(s)

@GautamR-Samagra

Complexity

Low

c4gt-community-support[bot] commented 9 months ago

Hi! Important Details - These following details are helpful for contributors to effectively identify and contribute to tickets.

Please update the ticket

vilol-04 commented 9 months ago

I would like to work on the issue @GautamR-Samagra

Gautam-Rajeev commented 9 months ago

@vilol-04 Thanks, have given access to all for the dataset. Do raise comments/PR when you are able to get significant results.

masterismail commented 9 months ago

Hi @GautamR-Samagra ! the cook book which you have mentioned medium. It's only for medium paid members.

Gautam-Rajeev commented 9 months ago

Hi @GautamR-Samagra ! the cook book which you have mentioned medium. It's only for medium paid members.

Oh sorry, their home documentation is also pretty instructive

aish7iitkgp commented 9 months ago

I would like to work on this issue @GautamR-Samagra

masterismail commented 9 months ago

Hi @GautamR-Samagra ! the cook book which you have mentioned medium. It's only for medium paid members.

Oh sorry, their home documentation is also pretty instructive

yeah ! While keeping that handy, I'm currently conducting an analysis here and facing challenges with the names of government schemes and the presence of Hinglish. Any recommendations for preprocessing?

TakshPanchal commented 9 months ago

Hey @GautamR-Samagra , I was doing EDA for the data here. We could try different models but I think embedding model has to be fine-tuned first. So, I wondered is there any bigger corpus of this type of texts where abbreviations are used in indian context?

Gautam-Rajeev commented 9 months ago

@masterismail @TakshPanchal have tried to clean up the queries a bit - remove the Odia questions at least.

Have reshared the dataset here predicted_values3.csv

For the short forms and names of scheme/fertilizer/pesticide.. will need the help of program team to get those word list. Will update here once I get that.

emharsha1812 commented 9 months ago

Hello! @GautamR-Samagra If this issue is still open, i would like to work on it

Gautam-Rajeev commented 9 months ago

We have some scheme names and pesticide names : Schemes 1: Link Schemes 2: Link

Crop-pesticide mapping :
Expert Committee Recommendations _2021-22.pdf

These are not well structured names in a column as we want, but such is work :) I'll try parsing and share a table version in 2 days.

I tried clustering on my end here but while smaller clusters are coming fairly well formed, the bigger clusters are mixing scheme(PM-Kisan) and paddy pesticide queries which is not good for us.

Update on own clustering attempt : On overall data, decent small clusters are being formed - but the 'anomaly cluster' and the first 1-2 big clusters are bad. They mix scheme and paddy pest data which is not good.

Also, looks like all the 'Hinglish' 'Odinglish' clusters somehow got clustered into one cluster for me

In initial attempt, most clusters being formed around crop names- for a crop (say wheat) - all questions like cultivation, pest questions got clustered. On the other hand, the attempt is actually all the types of questions being asked, like - Example types :

I want to find a finite list of such questions as above that cover 95% of queries. Maybe we need to do something else to get there. Any thoughts? @TakshPanchal @masterismail

In my notebook, I also tried to remove all crop names (just used hard-coded list of common crop names and replaced with 'crop) and reclustered to get these types of questions which gave me some better types but again ferilizer names, pest names are still there and the issue of big ugly clusters being formed is still there.

kartikbhtt7 commented 8 months ago

Hello @GautamR-Samagra can you assign this issue to me would like to give it a try also where can I dm you?

Gautam-Rajeev commented 8 months ago

Hello @GautamR-Samagra can you assign this issue to me would like to give it a try also where can I dm you?

Discordid: gautam28 gmail: gautam@samagragovernance.in

Gautam-Rajeev commented 8 months ago

Here is a list of common pest, pesticides to remove before clustering. [Uploading [Tables]Expert Committee Recommendations _2021-22 (1).pdf (1).xlsx…]()

Naveenpoliasetty commented 8 months ago

Hey @GautamR-Samagra can I have try ?

1DevVrat1 commented 8 months ago

Hello @GautamR-Samagra Sir. I have worked on the above problem and I have also created a Google Colab Notebook comprising of the model that clusters the given queries into topics and gives back a new csv file. Please inform me about the next step that I have to perform in order to get the ticket.

kartikbhtt7 commented 8 months ago

Here is a list of common pest, pesticides to remove before clustering. Uploading [Tables]Expert Committee Recommendations _2021-22 (1).pdf (1).xlsx…

this link redirects back to the same issue instead of any list of pesticides table. I tried clustering with multiple vectorizer algos and also tried on 'recobo/agri-sentence-transformer', it seems like 'recobo/agri-sentence-transformer' is working better, I gotta try once more after replacing the pesticides names with some other keyword can you please share me any ideas how can I collect the common pesticides name, I checked the pdf that was provided but it had a lot of pesticide names (119 pages) extracting which seemed kinda hard. I have also dm'ed you at discord, my id- smokey (smokey_d_scraper)

Gautam-Rajeev commented 8 months ago

reuploading the excel. last one seems to be a broken link. Thanks @kartikbhtt7

[Tables]Expert Committee Recommendations _2021-22 (1).pdf (1).xlsx

Sameer-Pal commented 8 months ago

@GautamR-Samagra is this issue still accepting PR, Can i work upon that ?

aditisingh2912 commented 7 months ago

Hi, I want to take up this task on Topic Modelling @GautamR-Samagra

HasanZaigam commented 7 months ago

Hello @GautamR-Samagra, could you please assign me this issue? I'll work on it with the best approach and try to fix it as quickly as possible. Thank you.

Jatayu-u commented 6 months ago

Hello @GautamR-Samagra, Is this issue still open? I want to work on it. My understanding of the problem is that we have to classify the question into 20 different agricultural topics then we can form clusters according to it.

My approach is to use a large language model like gpt-3.5-turbo for the multi-class classification - few shot. I will try achieving this if you will let me know.

Thank you

Gautam-Rajeev commented 6 months ago

Hello @GautamR-Samagra, Is this issue still open? I want to work on it. My understanding of the problem is that we have to classify the question into 20 different agricultural topics then we can form clusters according to it.

My approach is to use a large language model like gpt-3.5-turbo for the multi-class classification - few shot. I will try achieving this if you will let me know.

Thank you

this is closed for now here

Gautam-Rajeev commented 6 months ago

This issue has been closed by PR #316