Closed Gautam-Rajeev closed 6 months ago
Hi! Important Details - These following details are helpful for contributors to effectively identify and contribute to tickets.
Please update the ticket
I would like to work on the issue @GautamR-Samagra
@vilol-04 Thanks, have given access to all for the dataset. Do raise comments/PR when you are able to get significant results.
Hi @GautamR-Samagra ! the cook book which you have mentioned medium. It's only for medium paid members.
Hi @GautamR-Samagra ! the cook book which you have mentioned medium. It's only for medium paid members.
Oh sorry, their home documentation is also pretty instructive
I would like to work on this issue @GautamR-Samagra
Hi @GautamR-Samagra ! the cook book which you have mentioned medium. It's only for medium paid members.
Oh sorry, their home documentation is also pretty instructive
yeah ! While keeping that handy, I'm currently conducting an analysis here and facing challenges with the names of government schemes and the presence of Hinglish. Any recommendations for preprocessing?
Hey @GautamR-Samagra , I was doing EDA for the data here. We could try different models but I think embedding model has to be fine-tuned first. So, I wondered is there any bigger corpus of this type of texts where abbreviations are used in indian context?
@masterismail @TakshPanchal have tried to clean up the queries a bit - remove the Odia questions at least.
Have reshared the dataset here predicted_values3.csv
For the short forms and names of scheme/fertilizer/pesticide.. will need the help of program team to get those word list. Will update here once I get that.
Hello! @GautamR-Samagra If this issue is still open, i would like to work on it
We have some scheme names and pesticide names : Schemes 1: Link Schemes 2: Link
Crop-pesticide mapping :
Expert Committee Recommendations _2021-22.pdf
These are not well structured names in a column as we want, but such is work :) I'll try parsing and share a table version in 2 days.
I tried clustering on my end here but while smaller clusters are coming fairly well formed, the bigger clusters are mixing scheme(PM-Kisan) and paddy pesticide queries which is not good for us.
Update on own clustering attempt : On overall data, decent small clusters are being formed - but the 'anomaly cluster' and the first 1-2 big clusters are bad. They mix scheme and paddy pest data which is not good.
Also, looks like all the 'Hinglish' 'Odinglish' clusters somehow got clustered into one cluster for me
In initial attempt, most clusters being formed around crop names- for a crop (say wheat) - all questions like cultivation, pest questions got clustered. On the other hand, the attempt is actually all the types of questions being asked, like - Example types :
I want to find a finite list of such questions as above that cover 95% of queries. Maybe we need to do something else to get there. Any thoughts? @TakshPanchal @masterismail
In my notebook, I also tried to remove all crop names (just used hard-coded list of common crop names and replaced with 'crop) and reclustered to get these types of questions which gave me some better types but again ferilizer names, pest names are still there and the issue of big ugly clusters being formed is still there.
Hello @GautamR-Samagra can you assign this issue to me would like to give it a try also where can I dm you?
Hello @GautamR-Samagra can you assign this issue to me would like to give it a try also where can I dm you?
Discordid: gautam28 gmail: gautam@samagragovernance.in
Here is a list of common pest, pesticides to remove before clustering. [Uploading [Tables]Expert Committee Recommendations _2021-22 (1).pdf (1).xlsx…]()
Hey @GautamR-Samagra can I have try ?
Hello @GautamR-Samagra Sir. I have worked on the above problem and I have also created a Google Colab Notebook comprising of the model that clusters the given queries into topics and gives back a new csv file. Please inform me about the next step that I have to perform in order to get the ticket.
Here is a list of common pest, pesticides to remove before clustering. Uploading [Tables]Expert Committee Recommendations _2021-22 (1).pdf (1).xlsx…
this link redirects back to the same issue instead of any list of pesticides table. I tried clustering with multiple vectorizer algos and also tried on 'recobo/agri-sentence-transformer', it seems like 'recobo/agri-sentence-transformer' is working better, I gotta try once more after replacing the pesticides names with some other keyword can you please share me any ideas how can I collect the common pesticides name, I checked the pdf that was provided but it had a lot of pesticide names (119 pages) extracting which seemed kinda hard. I have also dm'ed you at discord, my id- smokey (smokey_d_scraper)
reuploading the excel. last one seems to be a broken link. Thanks @kartikbhtt7
[Tables]Expert Committee Recommendations _2021-22 (1).pdf (1).xlsx
@GautamR-Samagra is this issue still accepting PR, Can i work upon that ?
Hi, I want to take up this task on Topic Modelling @GautamR-Samagra
Hello @GautamR-Samagra, could you please assign me this issue? I'll work on it with the best approach and try to fix it as quickly as possible. Thank you.
Hello @GautamR-Samagra, Is this issue still open? I want to work on it. My understanding of the problem is that we have to classify the question into 20 different agricultural topics then we can form clusters according to it.
My approach is to use a large language model like gpt-3.5-turbo for the multi-class classification - few shot. I will try achieving this if you will let me know.
Thank you
Hello @GautamR-Samagra, Is this issue still open? I want to work on it. My understanding of the problem is that we have to classify the question into 20 different agricultural topics then we can form clusters according to it.
My approach is to use a large language model like gpt-3.5-turbo for the multi-class classification - few shot. I will try achieving this if you will let me know.
Thank you
this is closed for now here
This issue has been closed by PR #316
Goal:
Get an accurate list of topics (around 20 topics max) for an agri dataset of queries (has around 20k unique queries) using BERTTopic. Only the 'questioninEnglish' column is relevant for the analysis
Description
Be able to segregate the given dataset into topics using BERTTopics. The veracity of the clusters are difficult to measure and currently will have to be observed manually and verified. Any suggestions to measure this better are welcome
One can also use simple TF-IDF, Topic2vec or LDA if they form better clusters. The sentences are just one sentence questions, not a paragraph.
Implementation Details
It'll include the following :
Anyone is welcome to begin work on the ticket, it'll not be assigned to anyone in particular initially. One can ask doubts and provide solutions through comments. Relevant points and ticket will be assigned to the best PR raised.
Other links
Medium
Product Name
AI Tools
Organization Name
SamagraX
Domain
NA
Tech Skills Needed
Python, BERT, ML
Category
Feature
Mentor(s)
@GautamR-Samagra
Complexity
Low