Closed mialondon closed 1 year ago
Like I've said in other contexts, this follows the convention of alto2txt, i.e. the confusion lies a lot deeper than the lwmdb so rectifying this would mean restructuring the entire storage space system.
Perhaps we could re-frame this slightly within the spirit of radical collaboration... this is a long-standing, repeated request from the institution whose staff will have to spend the next 20 years correcting a misunderstanding that results from a naming decision made during a technical import process.
All newspapers except for JISC were digitised by the British Newspaper Archive, following the same imaging and post-capture processes. The difference for the latter is the funding source - FMP, LwM or the BL directly through HMD. Elsewhere we've started staying 'newspapers digitised by the British Library' instead of 'HMD' and variations, as using an internal department/project name is a recipe for future confusion.
How can we help resolve this issue so that the source of the papers is clear to people using the data or reading resulting publications?
e.g. could we expand the names in the field, or rename the field as 'project code'?
Looping in @thobson88, @DavidBeavan and @griff-rees who would probably be the best ones to address this, as I see two venues:
bna
designation from? just the folder structure? if so, we could likely just change it, but we'd need to be consistent in renaming across all the data repositories (in the widest sense, i.e. GitHub, local codebases and storages, Azure, Google Drive)Thanks @mialondon and @kallewesterling. Some initial thoughts:
The name
field in the DataProvider
class is currently populated with
bna
hmd
jisc
lwm
I think of these as abbreviations (or even slugs
) rather than names, but for the sake of ease with the current pipeline would adding text
fields like the following help?
full_name
description
However: as implied by the conversation above, I think there's more to this, and I'm afraid I don't know about the digitisation process well enough to advise on that. How these different categories should (and risks of how it could be) be interpreted in analysis comes to mind.
@kallewesterling: your plot in the example Jupyter
notebook is a classic example (current rendering, prior to concluding egress
):
@griff-rees at the moment, the different names imply a greater difference than exists in reality. They all came from the same sausage factory, except for JISC which came from a different millennia (and isn't singled out here).
Thanks @mialondon. I thought that might be the case. Would having additional information at the database level like the following be helpful:
full_name
description
provider
? Where provider
for lwm
and hmd
are listed as British National Archive
? We could then present that in the interface as
British Newspaper Archive
Living with Machines (provided by the British National Archive)
etc?
As a pragmatic solution, could we rename bna
to fmp
giving:
fmp
jisc
hmd
lwm
then bna
= fmp
+ hmd
+ lwm
?
We have been here before and there are some pain points, as we can't rely entirely on the intrinsic info from our providers, in some cases it's not correct
FWIW Dave's suggestion makes sense to me: all the newspapers are from the BL collection, it's less interesting who/where they were digitised by but under what digitisation programme, hence 'HMD', 'LwM', 'JISC' and 'FMP' are meaningful flavours of 'BL newspapers' in part because they connote different access rights. BNA is a misleading label as it's just a sub-brand of FMP and the name of their (current) web portal.
Rights change over time, and over context, so that doesn't really hold.
As a pragmatic solution, could we rename
bna
tofmp
giving:* `fmp` * `jisc` * `BL_hmd` * `BL_lwm`
Or something?
Rights change over time, and over context, so that doesn't really hold.
Thanks @mialondon. Some additional fields to store changes over time:
created_date = models.DateTimeField(auto_now_add=True)
modified_date = models.DateTimeField(auto_now=True)
As a pragmatic solution, could we rename
bna
tofmp
giving:* `fmp` * `jisc` * `BL_hmd` * `BL_lwm`
Or something?
Can we confirm and close this?
So for the future: for each source we will define
BL_hmd
)Data 'provider' descriptions: fmp: FindMyPast-funded digitised newspapers provided by the British Newspaper Archive
jisc: JISC-funded digitised newspapers
bl_hmd: British Library-funded digitised newspapers processed by the British Newspaper Archive
bl_lwm: Living with Machines-funded digitised newspapers processed by the British Newspaper Archive
@griff-rees hi! Can this be done before Wednesday?
Data 'provider' descriptions: fmp: FindMyPast-funded digitised newspapers provided by the British Newspaper Archive
jisc: JISC-funded digitised newspapers
bl_hmd: British Library-funded digitised newspapers processed by the British Newspaper Archive
bl_lwm: Living with Machines-funded digitised newspapers processed by the British Newspaper Archive
@griff-rees a slight tweak to the text, here
https://github.com/Living-with-machines/lwmdb/pull/154 should resolve this.
Names like 'bna' risk being confusing as all newspapers are bna.