OpenDataScotland / the_od_bods

Collating open data from across Scotland
MIT License
20 stars 18 forks source link

Extract category keywords from dataset title and description #211

Closed KarenJewell closed 1 year ago

KarenJewell commented 1 year ago

Is your feature request related to a problem? Please describe. Too many datasets are set as uncategorised. This is because current categorisation uses category keywords provided by the publisher. Where there are no keywords provided by the publisher, the dataset cannot be further categorised by us in the context of the ODS catalogue. There is a similar ticket #172 but it is a large ticket to tackle. This ticket is one step down, a subset just to extract keywords out of the dataset title and description to use for categorisation, categorisation still using the existing keyword matching system in merge_data.py.

Describe the solution you'd like

Describe alternatives you've considered

Additional context Completion of this ticket leaves #172 to be an exploratory piece using unsupervised learning, but still a step-up in performance until then.

nutcracker22 commented 1 year ago

Hi @KarenJewell, I just saw that you're working on this issue. 2-3 months ago I wrote a Job Requirements Analyser, and one of the functions basically does a good part of what you are describing above: taking a long string (req), removing stopwords, creating ngrams (one to three words), counting their frequency, sorting the list based on frequency and returning the value. Only take care about line 3 and 8, where it uses the 'qualifications' column of the pandas dataframe. Maybe this helps, if not, I'm curious about your approach.

import pandas as pd
import nltk

def nltk_processing(req):
    df = pd.DataFrame(req)
    df.columns = ['qualifications']
    stoplist = stopwords.words('english')
    print(stoplist)
    c_vec = CountVectorizer(stop_words=stoplist, ngram_range=(1, 3))
    # matrix of ngrams
    ngrams = c_vec.fit_transform(df['qualifications'])
    # count frequency of ngrams
    count_values = ngrams.toarray().sum(axis=0)
    # list of ngrams
    vocab = c_vec.vocabulary_
    df_ngram = pd.DataFrame(sorted([(count_values[i], k) for k, i in vocab.items()], reverse=True)
                            ).rename(columns={0: 'frequency', 1: 'bigram/trigram'})

    return df_ngram