Data provider names - Githubissues

mialondon commented 1 year ago

Names like 'bna' risk being confusing as all newspapers are bna.

kallewesterling commented 1 year ago

Like I've said in other contexts, this follows the convention of alto2txt, i.e. the confusion lies a lot deeper than the lwmdb so rectifying this would mean restructuring the entire storage space system.

mialondon commented 1 year ago

Perhaps we could re-frame this slightly within the spirit of radical collaboration... this is a long-standing, repeated request from the institution whose staff will have to spend the next 20 years correcting a misunderstanding that results from a naming decision made during a technical import process.

All newspapers except for JISC were digitised by the British Newspaper Archive, following the same imaging and post-capture processes. The difference for the latter is the funding source - FMP, LwM or the BL directly through HMD. Elsewhere we've started staying 'newspapers digitised by the British Library' instead of 'HMD' and variations, as using an internal department/project name is a recipe for future confusion.

How can we help resolve this issue so that the source of the papers is clear to people using the data or reading resulting publications?

e.g. could we expand the names in the field, or rename the field as 'project code'?

kallewesterling commented 1 year ago

Looping in @thobson88, @DavidBeavan and @griff-rees who would probably be the best ones to address this, as I see two venues:

alto2txt = is it dependent on this info? where does the package pick up the bna designation from? just the folder structure? if so, we could likely just change it, but we'd need to be consistent in renaming across all the data repositories (in the widest sense, i.e. GitHub, local codebases and storages, Azure, Google Drive)
lwmdb = this could potentially also be resolved, like @mialondon suggests above by renaming or adding fields in the schema, but how would this affect the timeline + MVP planning that was done last week @griff-rees ?

griff-rees commented 1 year ago

Thanks @mialondon and @kallewesterling. Some initial thoughts:

The name field in the DataProvider class is currently populated with

bna
hmd
jisc
lwm

I think of these as abbreviations (or even slugs) rather than names, but for the sake of ease with the current pipeline would adding text fields like the following help?

full_name
description

However: as implied by the conversation above, I think there's more to this, and I'm afraid I don't know about the digitisation process well enough to advise on that. How these different categories should (and risks of how it could be) be interpreted in analysis comes to mind.

@kallewesterling: your plot in the example Jupyter notebook is a classic example (current rendering, prior to concluding egress):

mialondon commented 1 year ago

@griff-rees at the moment, the different names imply a greater difference than exists in reality. They all came from the same sausage factory, except for JISC which came from a different millennia (and isn't singled out here).

griff-rees commented 1 year ago

Thanks @mialondon. I thought that might be the case. Would having additional information at the database level like the following be helpful:

full_name
description
provider

? Where provider for lwm and hmd are listed as British National Archive? We could then present that in the interface as

British Newspaper Archive
Living with Machines (provided by the British National Archive)

etc?

DavidBeavan commented 1 year ago

As a pragmatic solution, could we rename bna to fmp giving:

fmp
jisc
hmd
lwm

then bna = fmp + hmd + lwm?

DavidBeavan commented 1 year ago

We have been here before and there are some pain points, as we can't rely entirely on the intrinsic info from our providers, in some cases it's not correct

dcsw2 commented 1 year ago

FWIW Dave's suggestion makes sense to me: all the newspapers are from the BL collection, it's less interesting who/where they were digitised by but under what digitisation programme, hence 'HMD', 'LwM', 'JISC' and 'FMP' are meaningful flavours of 'BL newspapers' in part because they connote different access rights. BNA is a misleading label as it's just a sub-brand of FMP and the name of their (current) web portal.

mialondon commented 1 year ago

Rights change over time, and over context, so that doesn't really hold.

mialondon commented 1 year ago

As a pragmatic solution, could we rename bna to fmp giving:
* `fmp`

* `jisc`

* `BL_hmd`

* `BL_lwm`

Or something?

griff-rees commented 1 year ago

Rights change over time, and over context, so that doesn't really hold.

Thanks @mialondon. Some additional fields to store changes over time:

created_date = models.DateTimeField(auto_now_add=True)
modified_date = models.DateTimeField(auto_now=True)

mialondon commented 1 year ago

As a pragmatic solution, could we rename bna to fmp giving:
* `fmp`

* `jisc`

* `BL_hmd`

* `BL_lwm`
Or something?

Can we confirm and close this?

griff-rees commented 1 year ago

So for the future: for each source we will define

name: (like BL_hmd)
description

mialondon commented 1 year ago

Data 'provider' descriptions: fmp: FindMyPast-funded digitised newspapers provided by the British Newspaper Archive

jisc: JISC-funded digitised newspapers

bl_hmd: British Library-funded digitised newspapers processed by the British Newspaper Archive

bl_lwm: Living with Machines-funded digitised newspapers processed by the British Newspaper Archive

mialondon commented 1 year ago

@griff-rees hi! Can this be done before Wednesday?

mialondon commented 1 year ago

Data 'provider' descriptions: fmp: FindMyPast-funded digitised newspapers provided by the British Newspaper Archive

jisc: JISC-funded digitised newspapers

bl_hmd: British Library-funded digitised newspapers processed by the British Newspaper Archive

bl_lwm: Living with Machines-funded digitised newspapers processed by the British Newspaper Archive

@griff-rees a slight tweak to the text, here

griff-rees commented 1 year ago

https://github.com/Living-with-machines/lwmdb/pull/154 should resolve this.

Living-with-machines / lwmdb

Data provider names #92