Living-with-machines / lwmdb

A django-based library for managing the Living with Machines newspapers metadata database schema
https://living-with-machines.github.io/lwmdb/
MIT License
2 stars 0 forks source link

Ensure `publication_code`, `issue_code` and `item_code` uniqueness #174

Open griff-rees opened 1 year ago

griff-rees commented 1 year ago

A recent check of publication uniqueness suggests there are 76 newspaper publication_code duplicates (all just 1 other record, so a count of 2).

These might be cases of multiple editions of issue on the same day (following @kmcdono2 in #120), or actual duplicate records (meaning... just wrong). I think the majority of the publication_code cases are the later (and thankfully quite a few have no related issues, and by extension items):

>>> from django.db.models import QuerySet
>>> from newspaper.models import Newspaper, Issue, Item
>>> from lwmdb.utils import similar_records

>>> newspaper_same_codes: QuerySet = similar_records(Newspaper.objects.all(), check_fields=('publication_code',))
>>> issue_same_codes: QuerySet = similar_records(Issue.objects.all(), check_fields=('issue_code',))
>>> item_same_codes: QuerySet = similar_records(Item.objects.all(), check_fields=('item_code',))
>>> len(newspaper_same_codes)
76
>>> len(issue_same_codes)
81520
>>> len(item_same_codes)
3670454
>>> all(record for record in newspaper_same_codes if record['id__count'] == 2)
True
>>> all(record for record in issue_same_codes if record['id__count'] == 2)
True
>>> all(record for record in item_same_codes if record['id__count'] == 2)
True
griff-rees commented 1 year ago

see #55 and #93

griff-rees commented 11 months ago

Updated description and ease separating into separate tasks.

griff-rees commented 11 months ago
griff-rees commented 11 months ago

119 is also related.