Open KarenJewell opened 2 years ago
Had a little look into this. Doing unsupervised topic modelling in R:
For easier integration, may be best written in python script. And perhaps be a step after merged_data.py
Not sure the category will always be identified consistently. So will compare manual category against topic model to assess this.
No limitation to just Python. If R is the best language to use to implement it then feel free to use R 😊
Thanks for the steer on this @JackGilmore . Obviously, R is always the best language :wink: but we'll see. Still quite far off preparing a PR and exactly where it'll fit with the existing process. But interesting rabbithole, topic modelling for the win! :1st_place_medal:
Just wanted to make sure I wasn't giving you a bum steer after I mentioned last weekend we'd abandoned my wonderful C# code to use Python instead 😉. Looks like some promising work so far. I'm looking forward to seeing more!
@KarenJewell @JackGilmore Suggest instead of topic modelling, could use hugging face 'open source' AI to auto-categorize title + description strings based on the default categories.
ODSCategories.json would still provide the broad categories required but no need to maintain keyword list for each category.
Pros:
Cons:
I've tested out some pseudo code which works in principle:
from hugchat import hugchat
from hugchat.login import Login
email = "your@email.com" # ....pass in secret variable from github
passwd = "your_password" # ...pass in secret variable from github
sign = Login(email, passwd)
cookies = sign.login() # Save cookies to usercookies/<email>.json
sign.saveCookies()`
# Create a ChatBot
chatbot = hugchat.ChatBot(cookies=cookies.get_dict())
# Create prompt
prompt = "Using only the following categories: 'Food and Environment', 'Council and Government',
'Elections / Politics', 'Planning and Development', 'Housing and Estates',
'Parks / Recreation', 'Sport and Leisure', 'Education', 'Transportation',
'Law and Licensing', 'Business and Economy', 'Arts / Culture / History', 'Tourism',
'Budget / Finance', 'Health and Social Care', 'Public Safety',
tell me very briefly only the the categories that best match the following description: '"
+ str_title_description + "'."
# Return string from chatbot which should contain the categories we wish to identify,
# even if the original string (title _ description) didn't mention the categories:
bot_categories = chatbot.chat(prompt, is_retry = True, retry_count = 5)
categories_result = match_categories(bot_categories)
return categories_result```
Ideally we would identify the dataset category using keywords in dataset title and description.
Suggested reading: Topic modelling