211 Extract category keywords from dataset title and description

KarenJewell commented 1 year ago

Description

Development for issue #211.

category_keyword map moved to new independent file ODSCategories.json
text is extracted out of the dataset title and description to use for dataset/ asset categorisation in merge_data.py
categorisation function now in-place uses keyword substring matching in title and description text to assign categories
removed now-redundant column "CombinedTags" from merged_output.csv used in old categorisation process
added new column in merged_output.csv to record which keywords were used for categorisation

Have included merged_output.csv and merged_output_untidy.csv in PR as file structure has changed

Motivation and Context

This replaces the categorisation functions in merge_data.py to use title+description text as the basis for categorisation instead of publisher-provided categories. This means more datasets can be categorised as not all publishers provide categorisation, and hopefully more useful categorisation as it is not dependent on the publisher's original portal or corpus-specific context.

The original plan for #211 was to tokenise and match words, but I've deviated from the plan as realised phrase matching is more useful than word matching. Ngrams may have worked, but I was concerned about processing inefficiencies. Have chosen to go with a substring match instead, cycling through keywords in the map to find a match in the dataset title+description text. This puts an incentive on keeping the categorisation-keyword map as small as possible.

Decided not to cap number of categories as eventual distribution is appropriate - the curating of keywords help keep this down. See in screenshot 2 that the most number of categories assigned to an asset is 9 out of 16, but the vast majority of assets have 1-3 categories.

This change does add considerable processing time on to merge_data.py (+12seconds in tests)

How Has This Been Tested?

Ran with main.sh, manually checked output and created a healthcheck notebook. See impact of changes in screenshots

Note: two new modules have been imported but these are standard modules in python so no need to change requirements.txt. Good to check on another environment as well just in case.

Screenshots (if appropriate):

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[x] Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

[x] My code follows the code style of this project.
[x] My change requires a change to the documentation.
[ ] I have updated the documentation accordingly.
[x] I have read the CONTRIBUTING document.
[ ] I have added tests to cover my changes.
[ ] All new and existing tests passed.

JackGilmore commented 1 year ago

At a first glance on a mobile phone this looks good! I take it reducing the remaining uncategorised datasets will be down to fine-tuning the category keywords?

KarenJewell commented 1 year ago

Yep exactly that. That's a whole other "healthcheck" section I've created to monitor that but want to combine it with the stuff you did for organisations: i.e. monitoring the catalogue quality

KarenJewell commented 1 year ago

Great catch! Thanks @JackGilmore

I've made changes as suggested above @JackGilmore and @kymckay and it runs to JKAN as expected. Have also caught that I've accidentally removed cleaning from the previously relevant OriginalTags and ManualTags so have restored it.

JackGilmore commented 1 year ago

Ran this through again and works now. Thanks!

I imagine we'll iterate on this but just wanted to raise one of the things I picked up through randomly checking some dataset categories. The word "wards" is categorised as both "Council and Government" and "Health and Social Care", meaning that datasets that refer to electoral wards in their title or description are being incorrectly tagged with Health and Social care e.g. most of the polling places and districts datasets. From what I could see in this case, most of these keyword matches came from the description so maybe we want to switch off that keyword matching for now and focus on just matching based on title? Alternatively, we could do a first pass using just title and if there were no categories picked up then attempt on description too?

I can also see that we don't appear to do anything with the OriginalTags column anymore. Is it maybe worth trying to map some of these tags to our own ODS tags too?

Not sure if you've included this in your health check but it would maybe be useful for merge_data.py to produce a report for each dataset as it configures the tags so it's easier to troubleshoot some of the erroneous tags e.g. "There was a match for school in dataset title so category Education was applied"

I recognize that some of these things aren't easy fixes that may require some work so I'm happy for this PR to get merged as-is and we can pick up on this and improve at a later date.

KarenJewell commented 1 year ago

Valid points!

But I disagree that removing description is the answer. We use description because it is context rich, to remove it to fit this one situation is to remove the benefits it puts on other keywords and datasets. Instead, I think the appropriate approach is to remove "ward" as a keyword because you are right, it is too ambiguous for it to be a useful keyword. I have added a change for this.
In theory, yes we could just append OriginalTags to the Title and Description body of text that is used in the categorisation. But I've decided against this because publisher-provided tags are portal/environment specific. In the same way that ODS tags are not useful in publisher portals, their tags won't make sense in our broader category set. My concern is that keywords used by publishers may create artificial association. For example, one publisher has a category "enterprise and energy" but our categories split that into "Business and Economy" and "Food and Environment", so what would happen is that every dataset with that tag will get the "Food and Environment" category even if the dataset has naught to do with energy. (see screenshot example). So I'm not going to add OriginalTags just now, but we will monitor it.

Not sure if you've included this in your health check but it would maybe be useful for merge_data.py to produce a report for each dataset as it configures the tags so it's easier to troubleshoot some of the erroneous tags e.g. "There was a match for school in dataset title so category Education was applied"

I've predicted your request - we already have this in ODSCategories_keywords column. Healthcheck also monitors popular keywords as well as redundant ones over the catalogue. No change

OpenDataScotland / the_od_bods