Closed KarenJewell closed 1 year ago
At a first glance on a mobile phone this looks good! I take it reducing the remaining uncategorised datasets will be down to fine-tuning the category keywords?
Yep exactly that. That's a whole other "healthcheck" section I've created to monitor that but want to combine it with the stuff you did for organisations: i.e. monitoring the catalogue quality
Great catch! Thanks @JackGilmore
I've made changes as suggested above @JackGilmore and @kymckay and it runs to JKAN as expected. Have also caught that I've accidentally removed cleaning from the previously relevant OriginalTags
and ManualTags
so have restored it.
Ran this through again and works now. Thanks!
I imagine we'll iterate on this but just wanted to raise one of the things I picked up through randomly checking some dataset categories. The word "wards" is categorised as both "Council and Government" and "Health and Social Care", meaning that datasets that refer to electoral wards in their title
or description
are being incorrectly tagged with Health and Social care e.g. most of the polling places and districts datasets. From what I could see in this case, most of these keyword matches came from the description
so maybe we want to switch off that keyword matching for now and focus on just matching based on title
? Alternatively, we could do a first pass using just title
and if there were no categories picked up then attempt on description
too?
I can also see that we don't appear to do anything with the OriginalTags
column anymore. Is it maybe worth trying to map some of these tags to our own ODS tags too?
Not sure if you've included this in your health check but it would maybe be useful for merge_data.py
to produce a report for each dataset as it configures the tags so it's easier to troubleshoot some of the erroneous tags e.g. "There was a match for school
in dataset title
so category Education
was applied"
I recognize that some of these things aren't easy fixes that may require some work so I'm happy for this PR to get merged as-is and we can pick up on this and improve at a later date.
Valid points!
description
is the answer. We use description
because it is context rich, to remove it to fit this one situation is to remove the benefits it puts on other keywords and datasets. Instead, I think the appropriate approach is to remove "ward" as a keyword because you are right, it is too ambiguous for it to be a useful keyword. I have added a change for this.OriginalTags
to the Title
and Description
body of text that is used in the categorisation. But I've decided against this because publisher-provided tags are portal/environment specific. In the same way that ODS tags are not useful in publisher portals, their tags won't make sense in our broader category set. My concern is that keywords used by publishers may create artificial association. For example, one publisher has a category "enterprise and energy" but our categories split that into "Business and Economy" and "Food and Environment", so what would happen is that every dataset with that tag will get the "Food and Environment" category even if the dataset has naught to do with energy. (see screenshot example). So I'm not going to add OriginalTags just now, but we will monitor it. Not sure if you've included this in your health check but it would maybe be useful for merge_data.py to produce a report for each dataset as it configures the tags so it's easier to troubleshoot some of the erroneous tags e.g. "There was a match for school in dataset title so category Education was applied"
I've predicted your request - we already have this in ODSCategories_keywords
column. Healthcheck also monitors popular keywords as well as redundant ones over the catalogue. No change
Description
Development for issue #211.
ODSCategories.json
merge_data.py
merged_output.csv
used in old categorisation processmerged_output.csv
to record which keywords were used for categorisationHave included
merged_output.csv
andmerged_output_untidy.csv
in PR as file structure has changedMotivation and Context
This replaces the categorisation functions in merge_data.py to use title+description text as the basis for categorisation instead of publisher-provided categories. This means more datasets can be categorised as not all publishers provide categorisation, and hopefully more useful categorisation as it is not dependent on the publisher's original portal or corpus-specific context.
The original plan for #211 was to tokenise and match words, but I've deviated from the plan as realised phrase matching is more useful than word matching. Ngrams may have worked, but I was concerned about processing inefficiencies. Have chosen to go with a substring match instead, cycling through keywords in the map to find a match in the dataset title+description text. This puts an incentive on keeping the categorisation-keyword map as small as possible.
Decided not to cap number of categories as eventual distribution is appropriate - the curating of keywords help keep this down. See in screenshot 2 that the most number of categories assigned to an asset is 9 out of 16, but the vast majority of assets have 1-3 categories.
This change does add considerable processing time on to merge_data.py (+12seconds in tests)
How Has This Been Tested?
Ran with main.sh, manually checked output and created a healthcheck notebook. See impact of changes in screenshots
Note: two new modules have been imported but these are standard modules in python so no need to change requirements.txt. Good to check on another environment as well just in case.
Screenshots (if appropriate):
Types of changes
Checklist: