alphagov / govuk-content-metadata

GovNER: an encoder-based language model (RoBERTa) fine-tuned to perform Named Entity Recognition (NER) on GOV.UK content
MIT License
4 stars 1 forks source link

Fix training sample batch2 #21

Closed exfalsoquodlibet closed 2 years ago

exfalsoquodlibet commented 2 years ago

Summary

Hack-y code to fix the (incremental) training set after the sentences were accidentally segmented for step 2 when we annotated for the extra 8 categories: GPE, ORG PN, PERSON PN, POSTCODE, EMAIL, PHONE N, DATE, MONEY £

This requires a hack to ensure we could merge these annotations with the original set annotated for FORM.

This code addresses this. And also de-duplicated the set.

Checklists

This pull/merge request meets the following requirements:

Comments have been added below around the incomplete checks.