Extract category keywords from dataset title and description

Is your feature request related to a problem? Please describe. Too many datasets are set as uncategorised. This is because current categorisation uses category keywords provided by the publisher. Where there are no keywords provided by the publisher, the dataset cannot be further categorised by us in the context of the ODS catalogue. There is a similar ticket #172 but it is a large ticket to tackle. This ticket is one step down, a subset just to extract keywords out of the dataset title and description to use for categorisation, categorisation still using the existing keyword matching system in merge_data.py.

Describe the solution you'd like

Combine dataset title and description into single string/ text body
tokenise and remove stopwords
for each remaining keyword, get frequency count in body and return a matching category
retain top 5 most common categories (based on frequency counts) and set as dataset categories

Describe alternatives you've considered

consider standardising casing and stemming for more accurate comparison
consider that there may be no need to cap the number of categories - i.e. returning all categories may be appropriate. See this as a % of total categories - if most categories return most of the time, then it's a meaningless solution. But because we manually curate category keywords, it might actually be filtered enough.
If the resulting categorisation takes too long (compared to current merge_data.py performance) then consider categorising on top n keywords only (instead of all keywords). The catch is that the top common words may not be useful keywords, but we may be satisfied enough with speed and small % of datasets left uncategorised.
consider TF-IDF principles (although may be more appropriate at #172 stage)

Additional context Completion of this ticket leaves #172 to be an exploratory piece using unsupervised learning, but still a step-up in performance until then.

Hi @KarenJewell, I just saw that you're working on this issue. 2-3 months ago I wrote a Job Requirements Analyser, and one of the functions basically does a good part of what you are describing above: taking a long string (req), removing stopwords, creating ngrams (one to three words), counting their frequency, sorting the list based on frequency and returning the value. Only take care about line 3 and 8, where it uses the 'qualifications' column of the pandas dataframe. Maybe this helps, if not, I'm curious about your approach.

import pandas as pd
import nltk

def nltk_processing(req):
    df = pd.DataFrame(req)
    df.columns = ['qualifications']
    stoplist = stopwords.words('english')
    print(stoplist)
    c_vec = CountVectorizer(stop_words=stoplist, ngram_range=(1, 3))
    # matrix of ngrams
    ngrams = c_vec.fit_transform(df['qualifications'])
    # count frequency of ngrams
    count_values = ngrams.toarray().sum(axis=0)
    # list of ngrams
    vocab = c_vec.vocabulary_
    df_ngram = pd.DataFrame(sorted([(count_values[i], k) for k, i in vocab.items()], reverse=True)
                            ).rename(columns={0: 'frequency', 1: 'bigram/trigram'})

    return df_ngram

OpenDataScotland / the_od_bods

Extract category keywords from dataset title and description #211