acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
404 stars 278 forks source link

Author name URLs #623

Open mjpost opened 4 years ago

mjpost commented 4 years ago

I can't find where I commented on this, but now that ACL 2020 is collecting links to ORCid, Semantic Scholar, and Anthology pages, I'm reminded that we don't have stable author page names. For example, if another Matt Post comes along, we have to fork the current pages.

I like Semantic Scholar's approach, where for example my page is:

https://www.semanticscholar.org/author/Matt-Post/38842528

I don't know how the integer is selected, but we could use a similar system, say starting at 1, and moving up from there. When there is no ambiguity, the base page would redirect to /1, e.g.,

https://www.aclweb.org/anthology/people/matt-posthttps://www.aclweb.org/anthology/people/matt-post/1

When there is ambiguity, the base page would then be used to hold variants with no assigned ID.

davidweichiang commented 4 years ago

Since ACL 2020 intends to share information with future conferences, it may be desirable to commit to making all current author pages stable (so you will always be matt-post and future Matt Posts will have to get a suffix).

To handle ambiguous names, it may help to distinguish between names and people.

So, if there's only one Matt Post, then https://www.aclweb.org/anthology/names/matt-post redirects to https://www.aclweb.org/anthology/people/matt-post. But if a new Matt Post arrives on the scene; somehow he is assigned a new id, say, matt-post-nd. Then https://www.aclweb.org/anthology/names/matt-post becomes a disambiguation page that contains:

davidweichiang commented 4 years ago

I also wonder if, rather than creating a new system of IDs on top of ORCID, START, DBLP, Semantic Scholar, and Google Scholar, should we adopt one of those existing systems of IDs?

nschneid commented 4 years ago

In principle I like the idea of adopting ORCID since uniquely identifying people is its entire purpose. START usernames are not as clean (I've seen people with multiple START IDs), and the others are automatically mined and therefore subject to error. But what about authors who don't have an ORCID? I suspect in any event we'll need a mixture of external and internal IDs.

akoehn commented 4 years ago

This is not an easy problem: We should definitely not roll our own ID schema, relying on semanticscholar seems to brittle (how long will they / their IDs be around?); ORCID seems to be the best option because it is the only ID explicitly made for this job and I am very much in favor of future conferences collecting ORCIDs for submissions. However, it is not easy to 1) find the orcids for already existing papers and 2) deal with people without orcid.

The proposal by @davidweichiang seems sensible to me (existing author keeps URL on clashes), the /names/ URL would then be linked to from /people/ pages of people with multiple authors, similar to disambiguation sites on Wikipedia?

And whoever has access to people organizing conferences: please lobby for orcid, it will make our lives easier in the long run :-)

knmnyn commented 4 years ago

Hi all:

I'd also support ORCID. I had brought this up to TACL and CL before and then I understood that MIT Press was pursuing this anyways, so the editors on both CL and TACL stopped worrying about it.

I agree with Arne, not to create our own. This is exactly why ORCID was created in the same guise as DOIs, and it will survive any one potential parties' demise (the verdict is not so clear with Semantic Scholar, IMHO). I also agree with Nathan in that we definitely need at least an internal system.

I think we should use ORCID as a primary vehicle (and redirect folks to those IDs where possible) but also retain our own author URLs for cases where there are multiple namesakes; (matt-post, matt-post-2) . When and if an author mints a ORCID and reveals it to us, we permanently forward the existing namesake page to the ORCID (so matt-post gets redirected to 0000-0002-1297-6794 and we don't re-use matt-post again; the next matt-post is matt-post-3

Cheers,

Min

-- Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) :: kanmy@comp.nus.edu.sg (E) :: www.comp.nus.edu.sg/~kanmy (W)

On Fri, Nov 8, 2019 at 7:16 PM Arne Köhn notifications@github.com wrote:

This is not an easy problem: We should definitely not roll our own ID schema, relying on semanticscholar seems to brittle (how long will they / their IDs be around?); ORCID seems to be the best option because it is the only ID explicitly made for this job and I am very much in favor of future conferences collecting ORCIDs for submissions. However, it is not easy to 1) find the orcids for already existing papers and 2) deal with people without orcid.

The proposal by @davidweichiang https://github.com/davidweichiang seems sensible to me (existing author keeps URL on clashes), the /names/ URL would then be linked to from /people/ pages of people with multiple authors, similar to disambiguation sites on Wikipedia?

And whoever has access to people organizing conferences: please lobby for orcid, it will make our lives easier in the long run :-)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/623?email_source=notifications&email_token=AABU7263KCNSXUMI2OJ3NM3QSVDBLA5CNFSM4JJL7222YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDQNTTQ#issuecomment-551606734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABU72ZKWT2CZMAR4ULXX3LQSVDBLANCNFSM4JJL722Q .

mjpost commented 4 years ago

It seems ORCID is the way to go, when we have it. It's too bad that the ACL email that went out recently collected pretty much everything except ORCIDs.

knmnyn commented 4 years ago

It’s not too late per se. I think we could encourage Rich Gerber at START to add a field to the global profile to collect ORCID. It’d just not be mandatory at this point.

On Wed, 27 Nov 2019 at 08:33, Matt Post notifications@github.com wrote:

It seems ORCID is the way to go, when we have it. It's too bad that the ACL email that went out recently collected pretty much everything except ORCIDs.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/623?email_source=notifications&email_token=AABU723RXZIW4BMDNAM5Q53QVW55FA5CNFSM4JJL7222YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFH4KQI#issuecomment-558875969, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABU7225IDFY3VAM3ETPEZTQVW55FANCNFSM4JJL722Q .

--

  • M
mjpost commented 4 years ago

It would be good to have that in START, but without it being mandatory, I don't think anyone will fill it out. Though I bet we can triangulate them with all the other information we're getting.

knmnyn commented 4 years ago

Very true. If there's an automatic triangulation software one of us writes, we could have it validate the result by sending an email to the START user.

Cheers,

Min

-- Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) :: kanmy@comp.nus.edu.sg (E) :: www.comp.nus.edu.sg/~kanmy (W)

On Wed, Nov 27, 2019 at 11:17 AM Matt Post notifications@github.com wrote:

It would be good to have that in START, but without it being mandatory, I don't think anyone will fill it out. Though I bet we can triangulate them with all the other information we're getting.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/623?email_source=notifications&email_token=AABU722Z3CZ6FGSGFOQG6ALQVXRETA5CNFSM4JJL7222YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFIEUPA#issuecomment-558910012, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABU726HV6TLWHV4VIBSKF3QVXRETANCNFSM4JJL722Q .

mjpost commented 4 years ago

A good idea. Though I suspect that many people haven't even registered for an ORCID. I'm fairly trendy with these things and only recently did so myself.

mbollmann commented 4 years ago

Semantic Scholar is the main source of information in the ACL form, and that one asks for ORCID on sign-up, so that could be a place to start.

akoehn commented 4 years ago

@mbollmann I don't understand your suggestion (in: I do not even know whether you made a suggestion).

In my opinion, we (that is probably @mjpost) should lobby for adding (ideally: required) ORCID fields into future submission processes. Without that step, there will always be additional manual work to perform matching and I don't think that is sustainable in the long run. When ORCIDs are not introduced by conferences, there is little point in introducing them here.

The point of ORCIDs is that they are clean; performing error-prone matching all the time on our end defeats its purpose.

mbollmann commented 4 years ago

I was just pitching the idea that we could seed an initial ORCID database for the Anthology via Semantic Scholar. This does not include manual or error-prone matching because:

Now, I don't know how many people will have both claimed their SS page and added their ORCID, but it could be a start. Of course asking for them directly as part of the submission process should be the way to go in the future, I totally agree with you here, @akoehn.

akoehn commented 4 years ago

Yes, this is true. But: in our database, the ORCID is not a property of the author (as the whole point is that we do not have author entities in our data) but of each individual paper. We would therefore only obtain ORCIDs for ACL 2020 papers, and that with a lot of trouble: we would need to obtain the start username for every author, obtain the start-user -> semantic scholar mapping, and then query semantic scholar.

We would then need to back-propagate this information for every accepted paper. This seems to be quite a bit of work given that we probably will only be able to map a small subset of ACL 2020 submissions that way (not everyone has orcid, not everyone has a semantic scholar page, not every page is claimed, etc.).

nschneid commented 4 years ago

If the community wants to adopt ORCID then probably the best way is to make it a required part of the START global profile, in time for the ACL 2020 camera-ready deadline (IMO it would be too sudden a change to require it for the submission deadline).

in our database, the ORCID is not a property of the author (as the whole point is that we do not have author entities in our data) but of each individual paper

We effectively have (imperfect) author entities through a combination of the author name strings in paper entries and the name_variants database. I assume we'd need to (a) propagate ORCIDs backwards or (b) go with a hybrid strategy that clusters by ORCID where available and continues to use the name_variants system for compatibility with legacy data (or future data from non-ACL/non-START events).

If we want to be conservative about propagating ORCIDs backward, I suppose it might be possible to obtain START usernames on papers at least for recent major conferences, since START usernames are a more unique set of identifiers than the name strings (though some authors have multiple START profiles). Then these could be mapped to ORCIDs with growing coverage as more people update their global profiles for ACL 2020 and future venues.

We could also email authors on an ad hoc basis to confirm that the Anthology isn't conflating them with other authors. This would allow cleaner back-propagation of ORCIDs.

dowobeha commented 4 years ago

I concur that ORCID is the way to go. I would be in favor of making ORCID mandatory in START.

nschneid commented 4 years ago

Since ACL is a time for planning, I want to revisit this thread. Can we push for mandatory ORCIDs in START, maybe in time for EMNLP camera-ready? (@mjpost, would this require discussion among the ACL Exec?)

Note that START in general (at least for workshops; I don't know about EMNLP) allows listing unregistered users as authors. So I think the policy should be that camera-ready submissions have ORCIDs for ALL authors, and if it is a registered user it would be loaded automatically from the START global profile.

mjpost commented 4 years ago

Good idea. A few thoughts:

I like how author pages are guessable. One idea is to use a single guessable name ID page, eg anthology/people/matt-post/. This could serve as a collection place for undisambiguated names, and could also redirect to unambiguous names with IDs, eg anthology/people/matt-post/$ORCID. I’m not sure how we would disambiguate people who don’t have ORCIDs though or for whom we can’t get them.

nschneid commented 4 years ago

The simplest step forward might be to say that ORCIDs are attached as an extra field to papers, not Anthology author records directly, though of course any paper with an ORCID would allow us to unambiguously match against existing authors with the same ORCID on other papers (or to infer it's a new author if all existing authors by that name have papers with other ORCIDs).

The id attribute on an author name would continue to be used to disambiguate the Anthology author. Whether id is explicit or not, it would be an error for an Anthology author to have papers under multiple distinct ORCIDs.

Then we could allow manual disambiguation of past authorship by adding the ORCID for the paper. (Maybe there should be a UI for authors to do this themselves: manually verify their past papers. But if not it can be done directly in XML.) Thus any explicit ORCID in the XML would be trustworthy. Papers for which we don't have ORCIDs would continue to be assigned to semiautomatic author pages under the current system. Perhaps the verified/unverified distinction should be exposed to the user.

danielgildea commented 3 years ago

How about this: 1) We start including ORCIDs in the id attribute in the xml, when we get it from the venues. 2) Internally, our author ID is either the ORCID (if known) or the slugification of the name. 3) If no id attribute is present in the xml, and the author's name slugifies to the same thing as some other author field with an ORCID, they are considered to be the same person. This would be a change from the current setup where an error is generated if you use the same name with and without an id attribute. 4) However, if the same slugification appears with more than one id attribute (Yang Liu), then you have to specify the id wherever the name appears (as you do now).

This way we can gradually add ORCIDs for people already in the database, for the vast majority of cases where the name is unambiguous. There will be a few cases where, as ORCIDs come in, we realize that existing names refer to more than one person. At that point, we will have to retroactively disambiguate by hand.

As far as author URLs, I would say stick with anthology/people/matt-post/ except when ambiguous, in which case anthology/people/matt-post-$ORCID/.

nschneid commented 3 years ago

Would this mean overloading the id field to be sometimes ORCID (in new data), sometimes current ID for different papers from the same individual? I worry that this would be confusing for users of the data, who would expect different explicit id values to refer to different individuals. Might be better to have a separate orcid field.

danielgildea commented 3 years ago

I was imagining that we would replace the current IDs with ORCIDs when we find out the ORCIDs.

nschneid commented 3 years ago

Would these be manually reviewed? Just want to be sure new sources of noise are distinguished from authoritative pieces of metadata.

danielgildea commented 3 years ago

Yes.

mjpost commented 3 years ago

I like this, but what about the minor change of using people/matt-post/$ORCID/ as the author URL instead. This lets us easily identify all authors with a single SLUG and create disambiguation landing page, and also follows conventions used by other services, e.g., my page on Semantic Scholar.

One other thing this addresses: for authors we disambiguate manually, we can keep their ID that we choose for them. Should we ever get an ORCID for them, we can easily create a link to that as their canonical author page, so as to create link permanence.