Publisher name canonicalisation

richard-jones commented 9 years ago

Look at the list of publishers and identify those which have matching names, and have a process which can normalise them.

This is NOT a solution to the name variants problem, it is a stop-gap for the purposes of the demonstrator project.

ostephens commented 9 years ago

If someone created a dump of all publisher names across the aggregation I'd be happy to take that and create a mapping to canonical forms if that would help

richard-jones commented 9 years ago

Ok, that would be cool. I have a big pile of data to import into the system soonish, and when I've done that I'll do a report on all of the unique publisher names in the system and pass it over!

ostephens commented 9 years ago

Great. If you could include ISSNs in the dump (so one line per ISSN/Publish name variation combo) that would be ideal as it would help ensure I normalise the names in the same way for the same title

ghost commented 9 years ago

FYI I'm doing some normalisation before passing the files on to Richard, but only within an individual sheet rather than between sheets.

ostephens commented 9 years ago

Thanks Stuart. Do you have a list of the 'preferred' names you use? Or do you just normalise to the most common one within a sheet?

ghost commented 9 years ago

normalise to the most common one within a sheet. and fix spelling mistakes!

ostephens commented 9 years ago

I've extracted the full list of publishers from all data currently on the aggregation - that's 8662 records.

There are 393 unique publisher strings in this file. Having done a little bit of work on these and examining what we have by eye, I'd give a very rough estimated that there might be closer to 100 publishers represented (I may be way out, but I'm sure it is closer to 200 than 400)

I can clean these up by hand without a huge amount of effort but while this would be a quick and easy win for the current aggregation I'd like to look at slightly more systematic options for getting normalised forms of publisher name. So I propose:

1) I clean up by hand, and we push these values back into the aggregation for our demonstrator 2) I investigate how we might systematise clean up of publisher name and report back

My first thought for approaching (2) is:

a) As most records don't have ISSNs but do have DOIs, use the DOIs to get the ISSNs via CrossRef b) If this works then use the ISSNs to lookup publisher names from another source (probably either GOKb or Sherpa)

We can then compare what we get to my hand crafted cleanup and see if we've achieved a roughly similar outcome

I'd originally wondered if we could just look up the publisher strings against an authority file with lots of variant publisher names in it (GOKb has published such a file). However looking at the data we have I'm already pretty convinced this isn't going to work - the publisher name strings in our data are just too varied (sometimes incorporating both publisher and imprint, sometimes using the name of the related Learned Society, sometimes using the journal name instead of the publisher)

Any other ideas or comments on this approach very welcome.

richard-jones commented 9 years ago

If you want to do the name normalisation in some mapping file (possibly expressed in json), then I can add this to the functionality of the aggregation - both to normalise during ingest, and also to run over the dataset periodically. We could then just add to that mapping file over time, and keep/get the data in shape as we go along.

Either of these forms would be fine/easy:

A list of names, and the canonical names they map to (i.e. key/value pairs)
A list of canonical names, and the non-canonical forms that they may have (i.e. key/list set)

Otherwise, if you want to maintain a spreadsheet, I can write something to load it into the system for use during processing.

ostephens commented 9 years ago

Thanks Richard,

After I've done the clean up by hand I can give this back to you as a mapping between canoncial/non-canoncial forms - I hope I'll be able to do this by the end of the week.

JiscMonitor / allapc

Publisher name canonicalisation #7