IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 486 forks source link

Allow splitting of the "Name" field (Author, Contact, Producer…) into "First Name" and "Sur- / Family Name" #6492

Closed BPeuch closed 2 years ago

BPeuch commented 4 years ago

Version: 4.18.1

Hello everybody,

As you know, various metadata elements in Dataverse have a "Name" field: Author, Contact, Producer, Contributor, and Distributor. The watermark ("FamilyName, GivenName or OrganizationName") suggests that the name of either individuals or organizations can be encoded here, which is handy.

However, we fear that letting depositors encode e.g. an author's name with only just one "Name" field will generate hetereogeneity, as various users could come up with the following forms for the name of a single individual: • Jon Irenicus • J. Irenicus • J. G. Irenicus • Irenicus, Jon • Irenicus, J. • Irenicus J

I think splitting of "Name" into "First Name" and "Sur- / Family Name" would help make the metadata cleaner in the long run.

Granted, however you cut out metadata fields, you're still going to end up with thousands of John Smith at the end of the day, so other tools like authority files have to be maintained to keep things perfectly unambiguous and maybe it's asking for a bit too much to have something like that in Dataverse right now.

Still, splitting the "Name" field feels to me like a big step towards tidier metadata.

BPeuch commented 4 years ago

I was informed that there are more general threads for suggestions regarding changes to the standard Dataverse metadata. They're the following:

djbrooke commented 4 years ago

@BPeuch - I don't necessarily disagree with this, but I'd be concerned about the impact on an installation like Harvard Dataverse (or any other longstanding installations) where this would need a huge curation effort post-change to split the names into the new fields. I'm also interested in @jggautier's thoughts about interoperability here.

jggautier commented 4 years ago

The idea that two fields for first name and last name will encourage more homogeneity is a really interesting one. For example, instead of the variations you mentioned, each entered in a single text box...

  1. Jon Irenicus
  2. J. Irenicus
  3. J. G. Irenicus
  4. Irenicus, Jon
  5. Irenicus, J.
  6. Irenicus J

... would two text boxes for first name and last name encourage people to more often enter a name the same way? Dataverse does have watermarks that ask people to enter last name, first name, but people don't always follow that. Considering that name and the 6 variations, I would guess that having two text boxes would produce a little less variation, since the intention of two required text boxes would be harder to overlook or ignore than a watermark (or help/tooltip text):

  1. Irenicus, J. G.
  2. Irenicus, Jon
  3. Irenicus, J.
  4. Irenicus, J

Like you wrote, it won't fully resolve the problem.

There are cases when these fields need to be a single text box, like organization names or to accommodate cultural differences in name order, so I think Dataverse would need to give the depositor a choice of either two text boxes, for last name and first name, or a single name text box. And these could be two metadata fields. Then maybe the metadata of existing datasets wouldn't need to be split. Those existing values would just be added to the single author name metadata field. (And depositors/curators can return to datasets later if they want to switch to the last name, first name field.)

Another reason for being able to more reliably distinguish between first and last names is for better citation generation. Different citation styles want names formatted in different ways. (@adam3smith I think could provide a lot more insight about this :) @BPeuch, is that why you included issue https://github.com/IQSS/dataverse/issues/2297?

More broadly speaking, disambiguating names is the goal of efforts like ORCID, which Dataverse has and I hope will continue improving support for. The problem is humans don't search by ORCID ID (or ScopusID, etc.). I rarely see IDs like this in Harvard Dataverse's search logs. So it's still helpful if the actual names are standardized. That can happen during deposit, by doing something like using authority files to suggest preferred name spellings (can the ORCID database be used as an authority file?), and during curation of existing datasets. For example, @scolapasta suggested a tool or workflow where existing name metadata in a Dataverse repository is normalized against a database of standard names and entity IDs, like an authority file. You'd run the tool/workflow, get back a list of possible variations of the same name, check to make sure they're the same person (maybe by checking affiliation and ORCID ID), confirm that they're the same name and update the metadata with the preferred name. I've read that some repositories have integrations with OpenRefine to do this.

Lastly, for metadata exports that want names as one string, like DDI Codebook and Dublin Core, the values entered in the last name and first name boxes can be concatenated when the export file is created. Other exports, like DataCite, accept family and given names. And this might even help improve the accuracy of the nameType algorithm that Dataverse's OpenAIRE export uses (I think the original algorithm looked for names that were split into first name/last name to determine if an author name is a person or organization).

Are there other interoperability issues you had in mind @djbrooke? (I'll keep thinking, too)

djbrooke commented 4 years ago

Thanks @jggautier !

BPeuch commented 4 years ago

Thank you very much, both of you, for your feedback. Thank you for your keen insights, @jggautier.

@djbrooke Indeed, the challenge of ex post curation work sounds daunting when looking at mature Dataverse repositories like Harvard's. That's one of the reasons why I was think of making this an optional feature. Of course, this then raises the issue of interoperability, which @jggautier also mentioned, but indeed concatenation is an easy enough mechanism to anticipate problems in this respect.

Maybe I'm also influenced by my experience at my university's library catalogue department where a lot of QA is performed by student workers. One of the reasons for this is thatmy university — like many others, I'm sure — persuaded its researchers and professors to reference their publications in the institutional repository by making it the only place jurys will check to assess a researcher's scientific contributions (so they can't just send their CV; it will simply be ignored). It was therefore crucial to make sure that researchers' names be properly recorded in the system and also that, between the 5 or 6 people named Martin Dubois, publication records were linked to the right person.

And indeed @jggautier, the ORCID does help disambiguate in this respect. I'm not too concerned by the fact that regular users do not rely on it because, as I see it, it's a data curator's tool; not so much one for researchers.

I'm afraid you're all too right when it comes to the persistence of hetereogeneous data inputs. That is also why we at the State Archives of Belgium are hoping for more UI possibilities, like being able to add an explanatory line (see issue https://github.com/IQSS/dataverse/issues/6476) above a field to very clearly specify to users how we want them to format their content. At the end of the day however, we know that we will never get 100% correct data and that ex post curation / data cleaning will have to be done soon or late.

@scolapasta's idea for a Dataverse authority file sounds very interesting. I looked for it in the list of issues but couldn't find it unfortunately.

BPeuch commented 2 years ago

Looking back on this issue, the more I think about it, the more I think linked data + authority files are the way to go. Thousands of platform administrators managing thousands of lists of names and IDs that designate the same entities (individuals, publications...) which happen to navigate and be designated on all those different platforms — that's clearly not the way to go. That is, at least, definitely not the most efficient way to prevent ambiguity.

I think that I used to think in very classical documentary terms back when I suggested this, but now I believe such metadata mincing would just mean more work (more fields to fill out) for few gains at the end of the day.

Though I see the issue is mentioned elsewhere on GitHub, we can also see it has received little support from other Dataverse users, so I suggest we close it. 🔒

pdurbin commented 2 years ago

@BPeuch if you're feeling like you want to close this, please go ahead. 🔒