citation-file-format / citation-file-format

The Citation File Format lets you provide citation metadata for software or datasets in plaintext files that are easy to read by both humans and machines.
http://citation-file-format.github.io
Creative Commons Attribution 4.0 International
443 stars 108 forks source link

The future of required keys in `person` #329

Closed sdruskat closed 6 months ago

sdruskat commented 3 years ago

First thoughts, one of these should be required again perhaps:

sdruskat commented 3 years ago

This would also break "API" and would have to go in a new MAJOR version.

sdruskat commented 3 years ago

As discussed yesterday, we could look into implementing this via a combination of schemas for each option, branched via oneOf, which all anyOf take the remainder of keys.

sdruskat commented 3 years ago

If we decide to do this, this would go in version 2.0.0.

jspaaks commented 3 years ago

It looks like we could use an anyOf on top of required: https://stackoverflow.com/a/31841897

AdityaSavara commented 2 years ago

I think "anyOf" is probably what is needed. I think less restriction on this is better in terms of widespread usability of cff.

I am not sure what the implications of this thread are, so the following comment might be off topic: My personal view is that it is too restrictive to even require authors. If somebody only wants their software cited, without their name, they should have the option of presenting it to the world that way. If they just want to put "anonymous" or something that should be up to the author to decide. So along that line, it seems that person should be less restrictive.

There is also the issue of cultural norms and incomplete information being available. So I am a little wary of required definitions, since authors names may not fit into whatever fields we prescribe.

Note that this is different from how the CFF entry is parsed. If a an external format, BibFictional.txt is going to be generated, and requires a value, a "default" value can certainly be put into the BibFictional.txt file, but that is different from the CFF requiring it.

acli commented 2 years ago

I agree with @AdityaSavara about cultural norms. In fact the W3C said the same thing exactly a decade ago. (Not that they didn’t miss anything. They missed Hong Kong names, a system that apparently confused a good number of South American immigration officials; Taiwan seems to officially use the same system.) But I assume if a name can’t be conveniently broken down we can just use the name field.

I understand CFF need to fit in with existing citation systems, but CFF isn’t even currently truly compatible with APA and Chicago.

AdityaSavara commented 2 years ago

[EDIT: This comment should be skipped due to my having non-comprehensive knowledge of Bibliographic standards. This comment detracted rather than added.]

There are several cultures where the family name is the "first" name. For example, there is a developer named "Phan An", and the family name is "Phan". However, in English they would publish as "Phan An", so I think that citation-file-format should probably switch to "Firstname" "Lastname" rather than family-name and given-name. If such a switch is not made in the future, people from those cultures will need to put their "Family-name" and "Given-name" in the wrong fields in order to match conventional bibliographies.

Wikipedia lists some of these cultures at the below link under "Order of Names". It includes Vietnam and South India (those were the cases I was familiar with): https://en.wikipedia.org/wiki/Surname#Order_of_names

Note: I don't know what the best practices are for this. I assume that middle names are expected to be added into the Firstname field, but I think it would be better to have a separate field of Middlenames. Also, I think in some cultures there may be a space in the lastname, so it is good to have a separate Middlenames field.

acli commented 2 years ago

@AdityaSavara “Phan An” looks Vietnamese. In purely East Asian style names (the classic “CJKV” languages – Chinese/Japanese/Korean/Vietnamese) the family name is first, but in Hong Kong (and apparently Taiwan now) when you have both a Chinese and and English name, the family name ends up in the middle. I suppose this isn’t an issue when you’re citing papers (researchers know to move their family names to the last position), but this might become problematic when you’re, say, citing websites or citing code.

(I believe “Phan An” would still be cited as “Phan, A.” in APA or “Phan A” in Vancouver styles; I don’t think CJKV names get any special treatment in these styles.)

The problem with using first-name + last-name instead of given-names + family-names is exactly the Hong Kong names I described: You can’t handle names where the family name is in the middle. If you keep native name order but have to split it up as first/last anyway you end up with the mixed-up South American versions I mentioned in passing. It also gets really confusing for us Asian Canadians (or Asian Americans etc.) who are used to calling our family names “last names” and suddenly we have to sometimes call them “first names” (by which we always mean our given names, even if it’s last).

acli commented 2 years ago

I suppose what I was trying to say in my long comment is that if we really want to make CFF compatible with all possible citation formats, what we really need is a way (field?) to describe how to reconstruct a full name from the given-names and last-names fields. (For example, if I want my full name in native order as used in my birthplace, I might specify GFG for “get a \S+ from given-names, a \S+ from family-names, then a \S+ from given-names, then join them together by spaces”; or if might specify GFGG for people who don’t use the hyphen in their CJKV part of their names.

This is just an idea though.

acli commented 2 years ago

ETA: I understand in French formal style (and say normal Hungarian order) they also do family names first, so first-name/last-name meaning literally written first and written last doesn’t work even for European languages. I think we can kill the first-name/last-name idea now.

AdityaSavara commented 2 years ago

[EDIT: This comment should be skipped due to my having non-comprehensive knowledge of Bibliographic standards. This comment detracted rather than added.]

Ok, I think I see your point. Initially, I didn't see anything you've written that makes this a problem: first-name middle-name last-name

However, I now understand your point is that later when this needs to get constructed into a bibliographic citation, there will be problems. There are, it turns out, multiple publishing conventions for bibliographies.

My view is now... (1) It would be ideal if CFF could provide the correct general solution. I think it is good that this discussion is here. (2) This has definitely been discussed by other communities. (3) We don't seem to know the right answer, yet.

I imagined that there must be discussion of this for Vietnamese authors, and I came across this thread: https://sites.google.com/a/uw.edu/vietnamstudiesgroup/discussion-networking/vsg-discussion-list-archives/vsg-discussion-2009/indexing

There must be other discussions about this, existing, in the world. It seems to me that a comprehensive solution would be something like this, with all fields as optional (or maybe at least one field required)

first-name middle-name last-name full-name publishing-name-preference other-identifier: [like "email: anonymousFrog@gmail.com" "Prisoner Number: ___" ] [maybe this is already covered well enough by alias]

My personal view, after seeing everything written here thus far, would be that this is enough for now: first-name middle-name last-name alias orcid

acli commented 2 years ago

@AdityaSavara There are a few assumptions here, for example,

Also, the real issue with Vietnamese (and Chinese, and Korean, and Hungarian) names is whether to put a comma after family-names, not whether family-names should be analyzed as first-name or as last-name. The comma is the important bit because it is what indicates inversion.

Maybe I should show a concrete example. I was wary of using my name as an example, but since I already divulged my ORCid in a pull request, my full name is already out there so I might as well use it:

(Bold indicates family-names; this is a French convention. In my birthplace it’s more usual to use either all caps (also used in French) or underlining.)

(If I ever had a married name things get even more complicated because my two family-names could also switch positions.)

As you can see, if I were still there and I published some code (not paper) or somehow wrote a book, I could have used the second (native) form of my name. Under the current CFF schema my full name would be analyzed thus, in both cases:

which is perfectly correct (and would allow us to generate, e.g., a correct APA style reference); the problem is that there’s no way to reconstruct my full name’s native form (e.g., for a Chicago style traditional bibliography) because there is a qualitative difference in the two given-names and this qualitative difference is not being captured.

Getting back to Vietnamese names, if we captured this qualitative difference maybe then we could reconstruct Vietnamese names properly too.

If we switch to a first-name + middle-name + last-name model, how do you propose to analyze my name?

(BTW, I’m not saying the current schema is perfect. The most glaring problem (mentioned in the W3C article, which by the way was written by @r12a) is that in some cultures, family-names might not be a perfect fit.)

AdityaSavara commented 2 years ago

[EDIT: This comment should be skipped due to my having non-comprehensive knowledge of Bibliographic standards. This comment detracted rather than added.]

I don't think I am making most of those assumptions. I am not making assumptions about the meanings of the positions. I think the only assumptions I am making are that (1) Ordering is most important in bibliographic references, (2) that first-name, middle-name, last-name is sufficiently accommodating for ordering, and is close to culture-neutral, (3) that in the longterm an optional field could be added for the semantics, like a field called "name-semantics" or something similar.

If I am correct that in almost all cases only the ordering matters for bibliographic references, then we don't have to worry as much about the semantics. I suspect that in the future abbreviated names will probably just be written out completely, like "Aditya Savara" to reduce the amount of cultural centricity. Orcid is also helpful and is already a field for CFF.

Please note my distinction: I think that the semantics matters for library catalogues, and maybe how a person prints the works cited of a document, but I don't think it inherently matters for bibliographic entries in softwares like Endnote.

Ultimately, I think that right now each author has to decide what ordering they want, and then they're stuck with that choice. As far as I know, this is the state of bibliographies today. [with a few exceptions]

I consider the semantics to be a form of meta-data that could be entered in an optional field in some future version of the schema. This optional field could express meanings (and possibly parsing rules) for first-name , middle-name, and last-name. Note that doing so probably requires its own schema or schemas. That's another reason to delay doing so for a later version. This issue of how to add such metadata will hopefully be standardized by some other group by then. It may already be standardized without us knowing about it. In fact, that link that I provided points out that there already is a separate standard for Vietnamese libraries. Probably most cultures have some standard, so in the future those standards about what semantics can be gained from the ordering can be pointed to in the optional field.

r12a commented 2 years ago

If the first-name|middle-name|last-name series is merely intended to indicate which name comes first, second, and third in a row when displayed, why split the names up at all? Would it not be simpler to just allow the person to write their name, as they prefer it to be written?

That would also address situations where people's names don't fit neatly into a 3-part sequence. For example, how would you expect people to fill in the following names:

What's the benefit you hope to accrue in splitting the name up?

If it's to sort on a particular type of name, another strategy could be to ask separately for that item, eg. "Please list the part of your name we should use to sort it in a list of references:"

jodischneider commented 2 years ago

Names are often inverted in research paper bibliographies, with the sort based on the "last name". Citation manuals of style could be useful in understanding this if desired. Commonly used styles include Chicago, APA, Vancouver, AMA, etc. The Purdue OWL summary https://owl.purdue.edu/owl/research_and_citation/using_research/documents/20191212CitationChart.pdf could be a useful starting point. This does become complex for multipart names, and humans generally take their best guess.

On Tue, Sep 21, 2021 at 6:18 AM r12a @.***> wrote:

If the first-name|middle-name|last-name series is merely intended to indicate which name comes first, second, and third in a row when displayed, why split the names up at all? Would it not be simpler to just allow the person to write their name, as they prefer it to be written?

That would also address situations where people's names don't fit neatly into a 3-part sequence. For example, how would you expect people to fill in the following names:

  • Aditya Pratap Singh Chauhan (givenName-fathersName-surname-casteName)
  • Abu Karim Muhammad al-Jamil ibn Nidal ibn Abdulaziz al-Filistini (Father of Karim, Muhammad (given name), The beautiful, Son of Nidal, Son of Abdulaziz, the Palestinian)
  • 東海林賢蔵 (should never be separated by spaces or commas, etc.)
  • María José Carreño Quiñones
  • José Eduardo Santos Tavares Melo Silva
  • etc.

What's the benefit you see in splitting the name up?

If it's to sort on a particular type of name, another strategy could be to ask separately for that item, eg. "Please list the part of your name we should use to sort it in a list of references:"

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/citation-file-format/citation-file-format/issues/329#issuecomment-923882842, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADVGZWSYSPCVFG7Y576Y53UDBSW7ANCNFSM5C3I6VWQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

AdityaSavara commented 2 years ago

[EDIT: This comment should be skipped due to my having non-comprehensive knowledge of Bibliographic standards. This comment detracted rather than added.]

The reason for splitting up is that current bibliographic conventions require having "split up" names. So CFF's usefulness would be greatly diminished without splitting up. However, as far as I know, almost all conventions split up on name order rather than name meaning. [EDIT: The statement after "as far as I know" should not be considered as true, I am not knowledgeable on this topic..]

Note: I recognize that the current convention of splitting up and sorting by name order is not culture-neutral, and that many conventions infer name meaning from order. But CFF can't solve that issue in the short term. CFF is not a library cataloguing software. That 'infer name meaning from order' issue needs to be solved by the major publishers or librarian associations. That is why I suggest name order now, and later a name-semantics field can be added for metadata, so that CFF can be part of the longterm solution, too.

r12a commented 2 years ago

Thanks for the clarifications.

acli commented 2 years ago

@AdityaSavara It’s untrue that “almost all conventions split up on name order rather than name meaning.” Let me quote directly from CMOS 17, p. 951, 16.77 (“Indexing Chinese names” – okay, I know this is the indexing chapter, but that’s where they told us to look up in the bibliography chapter):

Since the family name precedes the given name in Chinese usage, names are not inverted in the index, and no comma is used. [Emphasis added]

Persons of Chinese ancestry or origin who have adopted the Western practice of giving the family name last are indexed with inversion and a comma [Emphasis added]

They don’t say it (this is my major complaint with the Chicago manual) but we can infer from these that meaning and order are both important. When family-names is not first (e.g., English names), the name is inverted, and a comma is added to flag that inversion.

If we privilege order without considering meaning, we won’t know whether to add a comma to signify that inversion. The result isn’t going to be in correct traditional bibliography format.

acli commented 2 years ago

@AdityaSavara Let be also quote directly from APA 7, p. 304, 9.45 (“Order of Surname and Given Name”):

Naming practices for the order of given name and surname vary by culture; in some cultures, the given name appears before the surname, whereas in others, the surname appears first. [...] For example, an author may publish as “Zhang Yi-Chen” in China but as “Yi-Chen Zhang” in the United States; in either case, according to APA Style, this author would be listed as “Zhang, Y.-C.,” in the reference list. [Emphasis added]

In other words, in APA style, only meaning is important, order is ignored, but the person writing the bibliography is responsible for extracting meaning from the name.

acli commented 2 years ago

Again, this is what they say in MLA 8, p. 102, 2.1 (“Names of Authors”, in the bibliography chapter):

The author’s name should be presented last name first in the works-cited list [Emphasis in original]

I don’t want to repeat myself but meaning is very much important in many major bibliographic styles.

acli commented 2 years ago

This is not a style I’m familiar with but I’ll also quote directly from the official NLM guidelines:

Enter surname (family or last name) first for each author/editor [Emphasis added]

Convert given (first) names and middle names to initials for a maximum of two initials following each surname [Emphasis added]

Again, meaning is what matters and order is ignored. If we’re splitting names, we must split on family-names and given-names (what CFF is already doing) because many traditional styles really do require us to extract family-names from the full name.

AdityaSavara commented 2 years ago

@acli I agree that most of the sources you found say what you are saying.

I apologize that my experiences, followed by the sources I then saw in the last few days, either misled me or were misinterpreted by me. I anyway view it as desirable to have semantic meanings, so I view my being incorrect as a positive situation. Thank you correcting me.

I also apologize for wasting other people's time, in addition to Ambrose's, as I was presenting information that I thought was correct, and that I had even spent time trying to look up in the last few days.

AdityaSavara commented 2 years ago

I withdraw my suggestion of "firstname | lastname". I like the existing "family-names | given-names | alias". Thank you, Ambrose.

Back to the original topic:

I think that none of these fields should be required for a person. We do see other fairly distinct fields already for a person, in this link like orcid, website, email. It seems to me that it should be okay if a CFF has incomplete person definitions. Less restrictions are good for cultural inclusivity, and also somewhat reduce the chances of people making schema-breaking records.

kevinmatthes commented 1 year ago

I would like to suggest to add a new required key to the person entity in order to challenge the difficulties arising from the different conventions outlined in this issue so far: interpretation.

This key is intended to be passed to the parsing tools to recommend a certain strategy on how the configured name data should be reconstructed. This would resolve the qualitative differences @acli explained. This reconstruction would then be completely outsourced to the parsing software. The CFF standard could therefore provide suggestions and / or minimum requirements on which conventions should be honoured by the reconstruction algorithms.

Regarding the possible values for this field, I would like to suggest to support, firstly, none and legacy. The first one is intended for the case that, for instance, name: The Community is specified as well as for pure alias based entries; the order of the information should be kept during the reconstruction. legacy refers to the current solution. Secondly, the values could be named by the regions where this scheme is typically applied. The region identifiers should therefore be named as general as possible. This would benefit users who are new to the CFF standard as well as those who do not frequently work with these conventions since the information would be provided in a very easily understandable way.

The field should be mandatory in order to prevent, for example, a BibTeX converter splitting name: The Community into authors = {{C}ommunity, {T}he}. When passing interpretation: none, the BibTeX converter should rather render authors = {{The Community}}.

jspaaks commented 7 months ago

It seems that the original issue (namely that of which existing fields in a Person are minimally required for identification / attribution) has been fixed in the development branch / what will become CFF 1.3.0 (see PR #462). We chose to see this as a bugfix rather than a major release, because we did not intend to have empty Person objects in CFF 1.2.0.

However, much of the discussion above is about how to store different parts of someone's name in CFF in order to

  1. support accuracy in converting to other formats
  2. support abbreviation of certain parts of someone's name
  3. support sorting of someone's name in a list of names, e.g. a list of references in a journal paper
  4. support reordering parts of someone's name

I suggest we move that part of the discussion to a dedicated issue. I have created #513 to that end.

sdruskat commented 6 months ago

Closing this as resolved by merging #462, and moved to #513 for further discussion of outstanding issues.