alphagov / govuk-content-metadata

GovNER: an encoder-based language model (RoBERTa) fine-tuned to perform Named Entity Recognition (NER) on GOV.UK content
MIT License
4 stars 1 forks source link

Phase2 training pipe #71

Closed exfalsoquodlibet closed 1 year ago

exfalsoquodlibet commented 1 year ago

Summary

Add your summary here - keep it brief, to the point, and in plain English. For further information about pull requests, check out the GDS Way.

Checklists

This pull/merge request meets the following requirements:

Comments have been added below around the incomplete checks.

exfalsoquodlibet commented 1 year ago

Thanks for the review @rory-hurley-gds. training_pipe/phase2_ner/src/preprocess.py converts annotated data from prodigy to spacy binary format. Taken from: https://github.com/explosion/projects/blob/v3/tutorials/ner_fashion_brands/scripts/preprocess.py