Standardize publisher names

dhimmel commented 7 years ago

Use a slug reductionism to eliminate casing or punctuation variants of the same publisher.

Add additional publisher name patches as per https://github.com/greenelab/scihub-manuscript/issues/24

dhimmel commented 7 years ago

For convenience, I've converted publishers.tsv to a spreadsheet as https://github.com/dhimmel/scopus/pull/3/commits/4eab1316814e2f4b2a089a89dff92e1d62a944ed: see publishers.xlsx. This contains the list of publishers. Now would be an ideal time to fix any mistakes, which include:

multiple rows (names) for a single publisher
misspellings or incorrect names of a publisher

@tamunro would love your eye here. Most of the Sci-Hub analyses will have to be rerun with this new dataset, so now is the time to identify any issues. Would like to get this wrapped up in the next day or two.

tamunro commented 7 years ago

I had a quick run over it, clustering with OpenRefine to give "Clustered publishers". I added an "edited" column (Y or N), currently filtered to the ones I changed: publishers-clustered.xlsx. Here's an excerpt of the redundant ones, sorted: publishers-redundant.xlsx.

Some of these involve subjective decisions, as you can see. Obviously, there will be many other redundancies, but it seems like a very low fraction. I removed "Springer" where it had been inserted.

dhimmel commented 7 years ago

Thanks @tamunro! This is tremendously helpful. Based on this contribution and your previous feedback, we'll make sure to include you as a coauthor on the next version of the manuscript (if this is something you are interested in).

I'll get to work incorporating these patches.

dhimmel commented 7 years ago

Scopus also seem to have done a find and replace that gives us "Human Springer Nature Review", "The International Society for the Study of Religion, Springer Nature & Culture", "Museum d'Histoire Springer Naturelle de Geneve" and "Fundacao O Boticario de Protecao a Springer Natureza" (!)

Ah I understand now. They replaced Nature with Springer Nature which munged all publishers containing Nature that were not JUST Nature. Looks like it's corrected online for 15500154703, which correctly lists "Human Nature Review" as the publisher, not "Human Springer Nature Review". @tamunro did you report the issues in https://github.com/greenelab/scihub-manuscript/issues/24 to Scopus? Perhaps this is how it got fixed!

tamunro commented 7 years ago

Thanks very much for the authorship offer! That's very generous. Don't finalize it yet - I left a more exhaustive clustering running overnight, and there are some more I'll fix today.

I did try to report the errors to Scopus, but their online help is dire, and I never got a reply. The only time I've heard from them is when I reported missing content on Sciencedirect, and my report got sent to Scopus by mistake. So I pointed out the mistake, and they sent that to Scopus by mistake too. Then I gave up.

dhimmel commented 7 years ago

Don't finalize it yet - I left a more exhaustive clustering running overnight, and there are some more I'll fix today.

Okay, I did make downstream updates in https://github.com/greenelab/scihub/commit/f8531bee9701e299a8d88d2a4ae654cbf3700239, but will rerun things to incorporate additional publisher corrections. Note that most differences in https://github.com/greenelab/scihub/commit/f8531bee9701e299a8d88d2a4ae654cbf3700239 are from other Scopus improvements (like better ISSN mappings) and not just from the publisher patches.

tamunro commented 7 years ago

Here's a greatly expanded version: publishers-clustered 2017-11-3.xlsx

It turned out to be a shockingly dirty dataset for publishers with few journals. The cleaning could go on forever. So it's probably best to hedge any conclusions about those ones. Also, Scopus might want to know about this. It might be possible to haggle something out of them in return. Do you any of you need Scopus access or use of the API?

dhimmel commented 7 years ago

@tamunro added these additional patches in https://github.com/dhimmel/scopus/commit/50171219c2b2261bd2529f8ca354e4ba25725626. Thanks a lot!

It turned out to be a shockingly dirty dataset for publishers with few journals.

Yeah its a mess, but I think our patches will have fixed most of the really atrocious errors.

Scopus might want to know about this

Sure... it'd be great if they fixed these issues upstream. Unfortunately I don't know there GitHub handle, but you should feel free to inform them. title-attributes.tsv contains a main_publisher column with the corrected publisher names.

It might be possible to haggle something out of them in return

Nothing is wanted in return and this repo is of the public domain. They should use it to improve their database's quality.

tamunro commented 7 years ago

Given how dirty the Scopus names are, another possibility that occurred to me would be to take them from crossref. Crossref's names are much more accurate, each record being created by the publisher itself. Their go-live list gives publishers and their DOI prefixes. There's still some redundancy (e.g. different prefixes for publisher divisions), but it's very uniform and I could fix that very quickly. These are the registrants of a given DOI, rather than the current publisher of a serial. Cabanac did this, so it would allow direct comparison with his results.

Alternatively, their title list has 55,000 serials with the publisher names and ISSNs, but not all the DOI prefixes. I presume these are the current publishers, not the registrants. Either way, from searching the lists, they're clearly vastly higher-quality than Scopus's.

dhimmel commented 7 years ago

Given how dirty the Scopus names are, another possibility that occurred to me would be to take them from crossref.

If we were to redo things, I'd probably switch all journal metadata to Crossref and entirely forgo Scopus. In the past, this was not possible due to https://github.com/CrossRef/rest-api-doc/issues/179. I think we want to hold off on re-architecting the Sci-Hub analysis as much as possible, given that the project is nearing completion.

The point about registrant versus current publisher is interesting. For the Sci-Hub coverage analyses, I'm not sure which one is more appropriate. Probably current publisher. Since there's been lot's of consolidation, if we took registration publisher, then there'd be lot's of publishers that have now been subsumed by a bigger one.

How did you find out about ftp://ftp.crossref.org/titlelist/titleFile.csv? I'm curious whether Crossref's FTP site hosts other useful files.

tamunro commented 7 years ago

I just found it on their website, googling for something. I know nothing about their ftp, I'm afraid. Maybe one of the developers could point you at some hidden extras.

dhimmel / scopus

Standardize publisher names #3