Closed KarenJewell closed 1 year ago
Hi @KarenJewell, I just saw that you're working on this issue. 2-3 months ago I wrote a Job Requirements Analyser, and one of the functions basically does a good part of what you are describing above: taking a long string (req), removing stopwords, creating ngrams (one to three words), counting their frequency, sorting the list based on frequency and returning the value. Only take care about line 3 and 8, where it uses the 'qualifications' column of the pandas dataframe. Maybe this helps, if not, I'm curious about your approach.
import pandas as pd
import nltk
def nltk_processing(req):
df = pd.DataFrame(req)
df.columns = ['qualifications']
stoplist = stopwords.words('english')
print(stoplist)
c_vec = CountVectorizer(stop_words=stoplist, ngram_range=(1, 3))
# matrix of ngrams
ngrams = c_vec.fit_transform(df['qualifications'])
# count frequency of ngrams
count_values = ngrams.toarray().sum(axis=0)
# list of ngrams
vocab = c_vec.vocabulary_
df_ngram = pd.DataFrame(sorted([(count_values[i], k) for k, i in vocab.items()], reverse=True)
).rename(columns={0: 'frequency', 1: 'bigram/trigram'})
return df_ngram
Is your feature request related to a problem? Please describe. Too many datasets are set as uncategorised. This is because current categorisation uses category keywords provided by the publisher. Where there are no keywords provided by the publisher, the dataset cannot be further categorised by us in the context of the ODS catalogue. There is a similar ticket #172 but it is a large ticket to tackle. This ticket is one step down, a subset just to extract keywords out of the dataset title and description to use for categorisation, categorisation still using the existing keyword matching system in merge_data.py.
Describe the solution you'd like
Describe alternatives you've considered
Additional context Completion of this ticket leaves #172 to be an exploratory piece using unsupervised learning, but still a step-up in performance until then.