Open richard-jones opened 9 years ago
If someone created a dump of all publisher names across the aggregation I'd be happy to take that and create a mapping to canonical forms if that would help
Ok, that would be cool. I have a big pile of data to import into the system soonish, and when I've done that I'll do a report on all of the unique publisher names in the system and pass it over!
Great. If you could include ISSNs in the dump (so one line per ISSN/Publish name variation combo) that would be ideal as it would help ensure I normalise the names in the same way for the same title
FYI I'm doing some normalisation before passing the files on to Richard, but only within an individual sheet rather than between sheets.
Thanks Stuart. Do you have a list of the 'preferred' names you use? Or do you just normalise to the most common one within a sheet?
normalise to the most common one within a sheet. and fix spelling mistakes!
I've extracted the full list of publishers from all data currently on the aggregation - that's 8662 records.
There are 393 unique publisher strings in this file. Having done a little bit of work on these and examining what we have by eye, I'd give a very rough estimated that there might be closer to 100 publishers represented (I may be way out, but I'm sure it is closer to 200 than 400)
I can clean these up by hand without a huge amount of effort but while this would be a quick and easy win for the current aggregation I'd like to look at slightly more systematic options for getting normalised forms of publisher name. So I propose:
1) I clean up by hand, and we push these values back into the aggregation for our demonstrator 2) I investigate how we might systematise clean up of publisher name and report back
My first thought for approaching (2) is:
a) As most records don't have ISSNs but do have DOIs, use the DOIs to get the ISSNs via CrossRef b) If this works then use the ISSNs to lookup publisher names from another source (probably either GOKb or Sherpa)
We can then compare what we get to my hand crafted cleanup and see if we've achieved a roughly similar outcome
I'd originally wondered if we could just look up the publisher strings against an authority file with lots of variant publisher names in it (GOKb has published such a file). However looking at the data we have I'm already pretty convinced this isn't going to work - the publisher name strings in our data are just too varied (sometimes incorporating both publisher and imprint, sometimes using the name of the related Learned Society, sometimes using the journal name instead of the publisher)
Any other ideas or comments on this approach very welcome.
If you want to do the name normalisation in some mapping file (possibly expressed in json), then I can add this to the functionality of the aggregation - both to normalise during ingest, and also to run over the dataset periodically. We could then just add to that mapping file over time, and keep/get the data in shape as we go along.
Either of these forms would be fine/easy:
Otherwise, if you want to maintain a spreadsheet, I can write something to load it into the system for use during processing.
Thanks Richard,
After I've done the clean up by hand I can give this back to you as a mapping between canoncial/non-canoncial forms - I hope I'll be able to do this by the end of the week.
Look at the list of publishers and identify those which have matching names, and have a process which can normalise them.
This is NOT a solution to the name variants problem, it is a stop-gap for the purposes of the demonstrator project.