Ask contractor for how bssw.io author names are chosen

bartlettroscoe commented 2 years ago

CC: @bernhold, @rinkug, @markcmiller86

Parent Issue:

1042

Description

We need to understand how the bssw.io backend determines names that are displayed on the site given the names of authors typed into the article vs. the author's GitHub account name.

For example, the bssw.io author "David Rogers" in the article:

https://raw.githubusercontent.com/betterscientificsoftware/bssw.io/master/Articles/Blog/WhyWeNeedStrategiesForWorkingRemotely.md (posted as https://bssw.io/blog_posts/working-remotely-the-exascale-computing-project-ecp-panel-series)

provides [David Rogers](https://github.com/frobnitzem "David Rogers GitHub Profile")

and has no other articles listed on his contributors page:

https://bssw.io/items?author=rogers

you see the name displayed as "David M. Rogers".

If you go to the linked GitHub page (which is the link "GitHub" on that page):

https://github.com/frobnitzem

you can see the name is displayed as "David M. Rogers".

I searched the entire bssw.io github repo with:

$ find Articles/ CuratedContent/ Site/ Events/ -type f -exec grep -li "m[.] rogers" {} \;

and there are no references to "m. rogers" (case insensitive).

Base on that example, it would appear that the bssw.io site generator get getting the name "David M. Rogers" from the GitHub account frobnitzem.

But then there is the bssw.io author "Mark Miller":

https://bssw.io/items?author=miller

that bssw.io names as "Mark Miller" but this GitHub name at:

https://github.com/markcmiller86

is "Mark (he/his) C. Miller".

So where did bssw.io get the name "Mark Miller" from?

Another strange thing is that there is a another "David H. Rogers" and bssw.io gives his author page:

https://bssw.io/items?author=rogers-679c5f04-7384-41fb-8340-e3e491898d12

with the GitHub user page:

https://github.com/dhrogers

which has the name "David H. Rogers".

So for the bssw.io authors "David M. Rogers" and "David H. Rogers", it looks like bssw.io is getting the name from their linked GitHub accounts. But for the bssw.io author "Mark Miller", it looks like bssw.io is getting the name from somehwere else.

markcmiller86 commented 2 years ago

There is definitely something odd here because I did a global update yesterday to change all instancess of Mark Miller to Mark C. Miller and that all worked and, furthermore, they all display that way on preview site.

I merged that PR yesterday. However, on the live site, I see only Mark Miller.

bernhold commented 2 years ago

@markcmiller86 Just like the preview site, the production site has to be explicitly rebuilt to update it after changes to the main branch in the gh repo. The last rebuild was 8 May. I've triggered another. You should now find that you are possessed of a middle initial. There are 30 articles under Mark C. Miller and a search for "Mark Miller" came up empty.

markcmiller86 commented 2 years ago

the production site has to be explicitly rebuilt to update it

Ah, I guess I either wasn't aware of that or had forgotten. Ok, thanks.

One funny thing I did run into...there was a "Mark Miller" in preview my sed replace of all content on master didn't catch. That was my name in a pending PR...the spotlight article David Rogers started. I had to go explicitly update that branch to ensure a new "Mark Miller" wouldn't wind up finding its way into master and once again causing ambiguity in which to use.

markcmiller86 commented 2 years ago

to ensure a new "Mark Miller" wouldn't wind up finding its way into master and once again causing ambiguity in which to use.

That makes me wonder...I think random selection amoung the ambiguous choices should be replaced with majority rule...use the case most frequently encountered and if their is a tie, then randomly pick from the tie. Otherwise, any time just one bad case finds its way into master it potentially corrupts all existing instances.

bernhold commented 2 years ago

The articles should always have the author name as it was written in the article. The Contributors page is the only place the conflict comes up.

To me, it is not worth the effort (programming or otherwise) to track all of the variations of every name. What might be useful is to institute the same rule for the gh ids as we have for names without gh ids -- every variation is listed and lists its articles. In this way, someone can review the Contributors page and look for author names that should probably be harmonized. That should probably be something we do during the preview process or at least on a schedule (like checking links).

markcmiller86 commented 2 years ago

The Contributors page is the only place the conflict comes up.

I guess I hadn't understood that and given that, I agree...coding work isn't worth the trouble.

That said, in other contexts I have some experience with, the list of authors is handled as an explicit database on the site and acts as the one true source for any author. That fixes many of these issues but it does demand that each new author gets added to the authors database and that as author's names change with time, it gets handled correctly.

The idea that our author data is simply free text that is scattered all over our content and which we gather up into a pseudo-database when we gen the contributors page seems a bit informal and fragile...at least for information as precious as authoring.

bernhold commented 2 years ago

I agree. And that's why when we started, we used GitHub as the one true source. But that has problems, as we've discovered. Not everyone has a gh account, and people put things in the name fields of their gh profiles that aren't necessarily what they'd want to put on another website or publication.

These days, the best possible answer is to get everyone's ORCIDs. How many people know their ORCIDs? And how many people even have them? (ORNL made us get them some years back, but it is not clear to me that they're requiring that of new employees.)

markcmiller86 commented 2 years ago

How many people know their ORCIDs?

I only got an ORCID as a result of my publishing the article with Mary Ann Leung in CiSE. Otherwise, I wouldn't have one and didn't as recently as a year ago.

I don't use LinkedIn or Facebook either. My home institution does provide something of a profile page which might be fine apart from GitHub. I dunno if most institutions do that?

Why do we need to require all authors use the same id profile resource? Why can't they all just use a URL...some can be GitHub, some can be ORCID, some can be LinkedIn...we don't care just as long as they give us a URL and how they want their name displayed.

bartlettroscoe commented 2 years ago

That said, in other contexts I have some experience with, the list of authors is handled as an explicit database on the site and acts as the one true source for any author.

Yes, that is what we should do. And we need a GitHub Actions check that makes sure that every author listed in an article matches their entry in the table. They can use anything they want for a unique identifier (e.g. GitHub page, ORCID, LinkedIn, Facebook, etc.) but that identifier (or URL) would need to be used with their name in all of their articles. Then we will see in their PR if there was a problem. This is how you would do things if you were thinking about bssw.io as a software project and using modern Agile SQA practices.

bernhold commented 2 years ago

Why do we need to require all authors use the same id profile resource? Why can't they all just use a URL...some can be GitHub, some can be ORCID, some can be LinkedIn...we don't care just as long as they give us a URL and how they want their name displayed.

Yes, we've discussed that, but haven't gotten as far as commissioning the work. And you still can't realistically count on that URL to be stable and persistent. If Mark gives us his LLNL profile page it presumably tied to his employment with LLNL, for example. Sure, once he gets his ORNL profile :-) he could change all of this articles over... Plus there's always going to be some resister who doesn't have, doesn't want to give, or ignores all request for a "profile page of their choice".

bernhold commented 2 years ago

Yes, that is what we should do. And we need a GitHub Actions check that makes sure that every author listed in an article matches their entry in the table. They can use anything they want for a unique identifier (e.g. GitHub page, ORCID, LinkedIn, Facebook, etc.) but that identifier (or URL) would need to be used with their name in all of their articles. Then we will see in their PR if there was a problem. This is how you would do things if you were thinking about bssw.io as a software project and using modern Agile SQA practices.

I like the idea in principle. But it would be a lot of work to implement. And the problem it addresses are, so far, pretty rare.

bartlettroscoe commented 2 years ago

And you still can't realistically count on that URL to be stable and persistent.

That is the author's business. If they don't want to provide a persistent URL to plug themselves, then why should we care?

Plus there's always going to be some resister who doesn't have, doesn't want to give, or ignores all request for a "profile page of their choice".

How realistic is that? I think we would be weeding out almost no one.

But it would be a lot of work to implement.

Really? Why? Given access to what the code currently does, I think I could implement that logic pretty fast. The code can already parse the author name and URL out of the article's *.md file. You just need to assert that author URL against one in the DB of authors and throw and error if it does not match. And if you find the URL, you get the author's other info from that line in the table.

For that matter, the way the bssw.io site references the author at:

https://bssw.io/items/contributors

needs some work. For example, "David M. Rogers" gets the nice identifier rogers and page:

https://bssw.io/items?author=rogers

but poor old "David H. Rogers" gets the nasty identifier rogers-660c2efa-eb81-472c-9a0e-522b357680ac and page:

https://bssw.io/items?author=rogers-660c2efa-eb81-472c-9a0e-522b357680ac

Why does "David M. Rogers" get the nice rogers but "David M. Rogers" gets the nasty rogers-660c2efa-eb81-472c-9a0e-522b357680ac? Where is the logic in that? That does not seem fair to "H.".

Why not let the author pick their own bssw.io identifier? For me, I would choose my GitHub user ID bartlettroscoe since that is how I am known almost everywhere (and my google ID is very close bartlett.roscoe). But the bssw.io generator choses 'bartlett':

https://bssw.io/items?author=bartlett

Why can't I choose bartlettroscoe? The bssw.io author ID should be a field in the list of author's table that the author can choose (or we can choose for them if this is their first article).

And the problem it addresses are, so far, pretty rare.

The issue is making it explicit how an author's name shows up on the bssw.io contributors page. It has nothing to do with how rare some use case is. (The very fact that 3 Ph.D.s with a combined 50+ years of computing experience are trying to figure out how this works should highlight the problem.) Explicit is better than implicit (I think that one goes way back to the mythical man month, not sure where I first read that one).

bartlettroscoe commented 2 years ago

It seems to me that anything you do to address the author name and ID problem would be an order of magnitude cheaper to implement than anything you would do to try to improve the search feature we are discussing in #1346.

markcmiller86 commented 2 years ago

And you still can't realistically count on that URL to be stable and persistent.

Maybe I was missing a key issue here. I saw the URL (for a profile page) as an optional thing an author may or may not want to provide and/or update with time. But, I had assumed we would internally create and use our own author-ids in some kind of a .yml file (database)...

bssw-author-id	Display Name	Profile URL
#mcm86	Mark C. Miller	https://github.com/markcmiller86
#bartlettroscoe	Ross Bartlett
#dbernhold	D. Bernholdt
#gonsie	gonsie	https://www.linkedin.com/in/gonsie
#bobsingleman	Robert F. Singleman

The only required column is the bssw-author-id. And, if an author here wishes to be anonymous (should we even allow that) or otherwise known only by that id, that is what we see as their name in anything they publish. Otherwise, we see their Display Name assuming they provide that. If they provide a URL, have their name hyper link to that URL.

All markdown content then does not ever use free-text for author names and instead uses the bssw-author-id. Thats true when formally listing authors for an article or when simply referring to them, by name, within the body of an article.

This is what I meant, at least, when I suggested having our own one-true-source database.

All that said, regarding what problems this truly solves and whether its worthwhile to put in the effort...I think author names are important enough that having some rigor and formality with which we handle them is an important service any organization serious about publishing original works would want to provide.

Does our current free-text-for-all approach manifest any issues...I kinda think it does...

Error checking names is not even possible, let alone implemented
Finding all the works by a given author is not reliable (due to possible ambiguity in naming or typos in names)
Does disambiguating authors with similar names (the David Rogers of the bssw.io world) work?
The list of contributors may randomly select the name it displays for a given author with multiple different instances

But, maybe we're not at that level of need yet either.

bartlettroscoe commented 2 years ago

Okay, to implement the above approach, you would have to edit ever existing *.md file and replace URL with <bssw-author-id>.

markcmiller86 commented 2 years ago

Okay, to implement the above approach, you would have to edit ever existing *.md file and replace URL with <bssw-author-id>.

Yes. But, if we ever wanted to get away from free-text for that, we'd have to bite that bullet some time.

bernhold commented 2 years ago

All that said, regarding what problems this truly solves and whether its worthwhile to put in the effort...I think author names are important enough that having some rigor and formality with which we handle them is an important service any organization serious about publishing original works would want to provide.

What the author puts in the article is how the name appears when rendered. This wasn't always the case, but it is now. This is no different from a journal. If I submit one article to CiSE with my name as "David E. Bernholdt" they'll publish it that way. If I submit another with my name as "D.E. Bernholdt", they'll publish it that way. It is up to me, as the author, to ensure my name is listed the way I prefer. CiSE isn't going to force me to change one. They may ask me for my ORCID but I don't see them directly listed in the published articles. If CiSE had an author index, it may or may not go through additional effort to connect the two variants of my name. This is the equivalent of our Contributors page. It's the only place where name variations might be evident on our site. And I'll say again, that I do not believe it is a big enough problem to warrant adding a lot of complexity. I realize that others may have different opinions.

markcmiller86 commented 2 years ago

Maybe the answer here is to forgo the Contributors page as it is currently designed and simply list all the unique author names we discover in free-text in our published content. No pictures, no links to profiles, no coelescing of same individual with different free-text for name. The only thing it does is link from the name to the list of articles that match that free-text name. This wouldn't address typos in names editorial board members introduce but would otherwise delegate the whole name question to the authors.

Over time, I could imagine an occasional request from an author that happened to use different free-text for their name to want that fixed and then make the request to EB members to have it done for them. But, I suspect that would be rather rare too. And, we could also tell them we'd accept a PR from them to fix but wont' fix ourselves.

bartlettroscoe commented 2 years ago

Maybe the answer here is to forgo the Contributors page as it is currently designed and simply list all the unique author names we discover in free-text in our published content.

Because of the problems with search, the Contributors page is about the most useful page on the site, at least for myself. But the Google query site:bssw.io/items bartlett does just about as well. And actually, Google is even better because I can find all my articles for the fiscal year with this Google query with limiting date ranges.

But is is nice having a single list of all the authors but I think I agree that if we are resource limited (which we are), less is more. (It is better to have 1/2 the features that are really solid than to have more features that have issues.)

bernhold commented 2 years ago

I keep repeating: the problems here are minor. I think you're proposing to throw the baby out with the bathwater.

bernhold commented 2 years ago

I have an explanation of why Contributor names appear as they do currently. It boils down to the following: 1) Apparently, our request to cease using GH profile names did not get conveyed to Parallactic, so when I thought they'd implemented it, they actually had not. This is why we get "David M. Rogers". 2) There is a file Site/About.md which lists the EB and Kasia, including names and GH profile links. The name provided hear for Mark overrides the one pulled from the GH profile.

I've asked for the following changes: 1) All author names should be treated the same (regardless of whether a GH ID is provided). Each variant will get its own entry in the Contributors page. It will be our job to catch these and fix them, either by making the name consistent or using the mapping capability in Site/Contributors.md (I've asked for that to be generalized so that it can be used with GH IDs or author names in the first column.) This happens rarely, so I do not see it as a problem. 2) Site/About.md should not impact names anywhere else on the site. We already have the Site/Contributors.md mechanism in place to map things, and we don't another obscure location coming in to confuse things.

We do not yet have an ETA for these changes.

rinkug commented 2 years ago

We are closing this issue and will track this in parent issue #1042

betterscientificsoftware / bssw.io

Ask contractor for how bssw.io author names are chosen #1344

1042

Description