OpenDataScotland / the_od_bods

Collating open data from across Scotland
MIT License
21 stars 18 forks source link

Auto-categorise datasets #172

Open KarenJewell opened 2 years ago

KarenJewell commented 2 years ago

Ideally we would identify the dataset category using keywords in dataset title and description.

Suggested reading: Topic modelling

fozy81 commented 1 year ago

Had a little look into this. Doing unsupervised topic modelling in R:

image

For easier integration, may be best written in python script. And perhaps be a step after merged_data.py

Not sure the category will always be identified consistently. So will compare manual category against topic model to assess this.

JackGilmore commented 1 year ago

No limitation to just Python. If R is the best language to use to implement it then feel free to use R 😊

fozy81 commented 1 year ago

Thanks for the steer on this @JackGilmore . Obviously, R is always the best language :wink: but we'll see. Still quite far off preparing a PR and exactly where it'll fit with the existing process. But interesting rabbithole, topic modelling for the win! :1st_place_medal:

JackGilmore commented 1 year ago

Just wanted to make sure I wasn't giving you a bum steer after I mentioned last weekend we'd abandoned my wonderful C# code to use Python instead 😉. Looks like some promising work so far. I'm looking forward to seeing more!

fozy81 commented 1 year ago

@KarenJewell @JackGilmore Suggest instead of topic modelling, could use hugging face 'open source' AI to auto-categorize title + description strings based on the default categories.

ODSCategories.json would still provide the broad categories required but no need to maintain keyword list for each category.

Pros:

Cons:

I've tested out some pseudo code which works in principle:


from hugchat import hugchat
from hugchat.login import Login
email = "your@email.com"    # ....pass in secret variable from github
passwd = "your_password"  # ...pass in secret variable from github
sign = Login(email, passwd)
cookies = sign.login()        # Save cookies to usercookies/<email>.json
sign.saveCookies()`
# Create a ChatBot
chatbot = hugchat.ChatBot(cookies=cookies.get_dict()) 
# Create prompt
prompt = "Using only the following categories: 'Food and Environment', 'Council and Government', 
 'Elections / Politics', 'Planning and Development', 'Housing and Estates', 
 'Parks / Recreation', 'Sport and Leisure',  'Education', 'Transportation', 
 'Law and Licensing', 'Business and Economy', 'Arts / Culture / History', 'Tourism', 
'Budget / Finance', 'Health and Social Care', 'Public Safety', 
tell me very briefly only the the categories that best match the following description: '" 
+ str_title_description +  "'."
# Return string from chatbot which should contain the categories we wish to identify,   
# even if the original string (title _ description) didn't mention the categories:
bot_categories = chatbot.chat(prompt, is_retry = True, retry_count = 5)
categories_result = match_categories(bot_categories)
return categories_result```