alphagov / govuk-content-metadata

GovNER: an encoder-based language model (RoBERTa) fine-tuned to perform Named Entity Recognition (NER) on GOV.UK content
MIT License
4 stars 1 forks source link

added src/utils/stratify_train_test_split_entities.py #75

Closed exfalsoquodlibet closed 1 year ago

exfalsoquodlibet commented 1 year ago

Summary

Added python module to do split of samples to train and dev set stratified by entity categories. I'll add tests when back.

To use it, from the terminal:

python src/utils/stratify_train_test_split_entities.py "path/to/dataset_to_split.jsonl" "path/to/folder/where/to/save/outputs" 0.2

Checklists

This pull/merge request meets the following requirements:

Comments have been added below around the incomplete checks.