acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
409 stars 281 forks source link

citations missing in Google Scholar #434

Closed bekou closed 4 years ago

bekou commented 5 years ago

Hi,

I am new to the NLP community and thus I don't know if this is a known issue. I think that Google Scholar misses a few citations to my work. I don't know if the issue is that they don't parse the pdf documents correctly but the issue is there. For instance, check this document here that includes only 4 citations. Here you can find some documents (i.e., most of them using the ACL style file) that are not included as citations:

1) https://www.aclweb.org/anthology/papers/N/N19/N19-5001/ 2) https://arxiv.org/pdf/1905.07458v1.pdf 3) https://arxiv.org/abs/1906.07544 4) https://arxiv.org/abs/1905.05044 5) https://www.aclweb.org/anthology/papers/N/N19/N19-1081/

Specifically, from the last document (i.e., 5th which I am also a co-author), none of the citations to my work are added. Is this normal?

I have also this issue to other documents but most of the missed citations are from ACL style documents.

Best, Giannis

mbollmann commented 5 years ago

The ACL papers you linked are NAACL 2019 papers which have only been online for less than a month, and it looks like Google Scholar hasn't even indexed them yet. The only solution there is to wait, AFAIK.

mjpost commented 5 years ago

It had indeed not been crawled. I asked Google to crawl it via the search console. This should probably be part of the ingestion checklist...

bekou commented 5 years ago

@mbollmann @mjpost Thanks for your prompt replies. However, the 5th document although it has been published on NAACL, it is on arxiv from March. Are you aware of any known issues where the ACL style files are not being parsed correctly from scholar?

Another example might be the 1st document which as far as I can see, exists on scholar and 1 and 2 receive the citation properly while my document from emnlp 2018 does not. Are you aware of this type of issues?

Best, Giannis

mbollmann commented 5 years ago

I don't see how ACL style files could possibly factor into this. The only relevant factor in this case should be the output they produce, and I don't see how our references format is that special or different from others that it could cause issues.

bekou commented 5 years ago

I totally agree with that. To me the documents look totally fine, but since the issue exists (I can find several examples for that), I was just wondering whether this is a known ACL-style issue or it is a scholar issue.

annargrs commented 4 years ago

@mbollmann @mjpost As of May 2020, the issue doesn't seem to be resolved.

This does look fishy...

akoehn commented 4 years ago

This does look fishy...

But, unfortunately, seems to be a problem with the way that google scholar indexes papers and links citations, not with the anthology. Unfortunately because otherwise there could be a way for us to fix it.

annargrs commented 4 years ago

I've just posted this on Twitter, and Fernando Pereira reported this to GS team. Hopefully they can do something about it.

Here's the thread, just in case there are any updates or other people report the same issue: https://twitter.com/annargrs/status/1262050827600084993?s=20

mjpost commented 4 years ago

Thanks for drawing this to people's attention, @annargrs. Maybe Fernando's attention can help fix this.

I wonder if an SEO effort might be helpful, for example, lots of academics adding deep Anthology links from their web pages. In general, though, I think it's going to be hard to outrank the arXiv.

mjpost commented 4 years ago

FYI, looking up @annargrs paper in the Google Search Console, it reports that it is not in the index:

image

That appears to be because we declare the version without the slash as canonical. Looking at that page, I see that it's not in the siteindex:

image

So maybe the issue is partly due to us being inconsistent in what we call the canonical page.

emjotde commented 4 years ago

Hi, coming from the old twitter thread. This year there is something odd with how NLP publications are collecting citations on Google Scholar, not just ACL. I am using this issue for documenting this, even if it is not just ACL-related. It used to be that GS had more citations for me on average, although some papers were under-counted. Now Semantic Scholar is running away with citations counts, quickly. They do seem to be proper citations. I am easily disambiguated because my name is likely unique.

A few examples from my author pages (GS: https://scholar.google.com/citations?user=Uh_GH14AAAAJ&hl=en) and (SS: https://www.semanticscholar.org/author/Marcin-Junczys-Dowmunt/1733933?sort=total-citations). Numbers are GS vs SS, I list the ones with the largest gaps and mostly ACL, but all of them are now under-counted on GS when comparing to SS.

mjpost commented 4 years ago

We’re getting some progress on Twitter thanks to @annargrs’ tweet and a response from Fernando. Anyone have any idea what this could mean, though?

https://mobile.twitter.com/earnmyturns/status/1271139856266096643

emjotde commented 4 years ago

PDF documents?

nschneid commented 4 years ago

PDF documents?

Yeah I assume to track citations they have to parse PDF bibliographies. Of course this varies depending on the BibTeX as well as the somewhat-venue-specific stylesheet.

mjpost commented 4 years ago

Oh, of course. Hmm, I didn't realize that had changed in recent years, but I also assumed they would have had a more robust parser for it. I bet it's a huge headache.

akoehn commented 4 years ago

Looks like this is fixed: my citations jumped noticeably and others noticed the same: https://twitter.com/sebgehr/status/1274304855797125120

We can probably (edit: close) this issue.

emjotde commented 4 years ago

Whoa, 30% jump. Nice.

drvenabili commented 4 years ago

Looks like this is fixed: my citations jumped noticeably and others noticed the same: https://twitter.com/sebgehr/status/1274304855797125120

We can probably (edit: close) this issue.

Came here to say the same thing. Cool!

emjotde commented 4 years ago

I'm kinda curious now. Do we know what happened somewhere mid 2018 where I think this started?

mjpost commented 4 years ago

I really don't know. Maybe something to do with the hyperref package? But the thing, is the variance within individual citation styles seems to me to be greater than that over years. Not everyone uses the official styles, or gets their BibTeX from the same place, etc. I paged through a few examples from ACL 2017 vs. ACL 2019, and didn't really notice any patterns.