Open jeisner opened 1 year ago
The search is -- unfortunately -- not under our control, but under google's. The only way for us to fix these issues would be to move to a different search engine, for which we do not have the (personal) resources.
Well, we do have control over how we configure the Google Custom Search, and the stark discrepancy in results between different sorting options could be a sign that something is misconfigured.
I took a look at the settings, but can't see any obvious problem, though. The "sort by year" option should sort by the <meta name="citation_publication_date">
tag, which all our paper pages should have. I am puzzled why the search would even filter results based on a sorting option, which doesn't make any sense to me. I can have a look at whether the naming and content of the field is in line with what GCS expects.
That said, I do have a strong interest in replacing the Google search as well as time to do it (as evidenced by e.g. https://github.com/acl-org/acl-anthology/issues/165#issuecomment-1475323177), but unfortunately this isn't as simple as other changes to the site. At the moment, Matt and I are actively investigating whether it's an option to use Semantic Scholar instead.
Replying to @akoehn (sorry, messages crossed):
Right, I know that Google is providing the service.
If this feature can't be made to work, then maybe it should be removed.
But I assume it worked when the feature was added. And Google still seems to support an API for it.
They say that the Anthology could provide them with the metadata for this in any of several ways, for example:
A meta tag of the form
<meta name="pubdate" content="20100101">
can be used with a search operator of the form:&sort=metatags-pubdate
.
Hmm, I see now that the search results for @jeisner's query are almost exclusively PDF files, probably because that query hardly appears in titles/abstracts, but only the fulltext. So maybe the results disappear when sorting by year because PDF files, by virtue of not being XML, don't have <meta>
tags?
Sounds right. The one that appears does have Snowdon
in the abstract.
My attempt to sort by @mbollmann's field, https://aclanthology.org/search/?q=snowdon+nun&sort=metatags-citation_publication_date , still displays all of the results. But they are still sorted by relevance, I think. So maybe my &sort
attempt is overridden in the back end by something that imposes the Sort by Relevance that is advertised on the page.
Fortunately, Google provides other means for supplying metadata info about the indexed files. This thread gives advice about how to do it for PDFs.
You might also be able to modify the PDF files themselves to add a custom metadata field, or just use the existing Created: and Modified: fields, but it seems safer for various reasons to supply the metadata from outside.
In particular, you can specify PageMap data in the Sitemap.
Thanks @jeisner!
So from that document, what we could try is adding PDF files to the sitemap with meta information like this:
<url>
<loc>https://aclanthology.org/2022.acl-long.1.pdf</loc>
<PageMap>
<DataObject type="metatags">
<Attribute name="citation_publication_date" value="2022/5"/>
</DataObject>
</PageMap>
</url>
This should hopefully add metadata in a way that Google sees as equivalent to the <meta>
tag on the landing page.
@mjpost I've tried adding this in f696824e034f1c6e0dddaf3e317345c62ab13d83; before I make a PR, I'd suggest I build the site locally with that sitemap and try submitting the sitemap manually to Google Search to see if it works. I'm thinking I should also wait until after the ACL ingestion is complete to try this.
Ah, my bad. I can't submit a sitemap XML file to the Google Search console, only a URL to a sitemap file. So I guess we'd have to merge the PR first and then see if it worked...
In any case I merged in the new ACL ingestion and checked that the sitemap generates as intended, at least.
Re-opening this until Google has processed the sitemap and we can check results.
Did you manually resubmit it in the Google search console?
Yes, I manually resubmitted the index file, and that caused at least some parts of the sitemap files to be re-read immediately, revealing a namespace error message (see #2615).
One aspect I still dislike about this approach is that the search is leading directly to the pdf and not to the canonical page, but that seems to be something we need to accept as long as we do not post-process the results or switch to a different provider. Or can the canonical site be set as canonical for the pdf in the sitemap?
That being said, thanks for the pointer, @jeisner! Seems like I was just too used to the search not working as intended.
Sorting by "Year of Publication" now makes PDFs show up for me!
That it doesn't show all of them might be because not all parts of the sitemap have been re-read by Google Search so far; let's wait a bit and see.
Awesome, that's progress! Thanks @mbollmann!
In addition to only 5 of the "about 123" results showing up so far, I notice that their years are 2017, 2022, 2018, 2018, 2017, in that order, so it's not quite reverse chronological -- the first one is out of order. (Maybe there is a bug in the new sitemap data?)
In addition to only 5 of the "about 123" results showing up so far, I notice that their years are 2017, 2022, 2018, 2018, 2017, in that order, so it's not quite reverse chronological -- the first one is out of order.
I wonder if the value of the field, which is of the format "YYYY/MM", isn't interpreted as intended by the sorting algorithm. I'd have to play around with a few different queries and check if I can spot any pattern...
EDIT: But I think it's best to give it a day or two to make sure that it's not just Google's database not being fully caught up with the new sitemap yet. I don't know if "sitemap was read" means that changes are reflected instantly on the search as well.
Confirm that this is a bug report
Problem Description
The following query gets "about 28" hits, including many paper PDFs. But when I select "Sort by Year of Publication," only 1 hit remains. Maybe this means that publication year is missing for many papers? (Should they be shown anyway, perhaps at the end of the listing?)
https://aclanthology.org/search/?q=snowdon+nun
Ok, it does concern those things, but not for a specific paper ...
I'm using Google Chrome 114.0.5735.198 on Linux Mint.