acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
404 stars 278 forks source link

Search box #165

Open mjpost opened 5 years ago

mjpost commented 5 years ago

We haven't discussed what to do with the search box when the static site goes live. Currently it's disabled. I suggest that we either

  1. Schlep it off to third-party search per #50, or
  2. Keep the SOLR database at aclanthology.info running, and link to there

I do like our custom search, and we get some more control over the results. So I suppose (2) is semi-dependent on finding someone to maintain it. Thoughts?

CC: @mbollmann @villalbamartin

mbollmann commented 5 years ago

So I don't know anything about Apache Solr, but the way we currently use it seems to be tightly integrated with the Rails application (the instructions use rake commands for Solr indexing). I feel it'd be a huge hassle to keep it in sync with the static website and integrate it in a sensible manner.

I've also never used Google Custom Search before, but from what I gather from the documentation, it seems to be pretty powerful in its ability to customize how the search works exactly, and it should certainly be much easier to integrate into the static website. If we count as a non-profit organization (here's where I reveal my total ignorance of the legal status of the ACL—do we?), we can even get rid of advertisements in the search.

So my suggestion would be to go for the Google solution now, then see if we are satisfied with it or want something tailor-made again in the future, which someone would probably have to re-build and maintain (it should at least be decoupled from the whole Rails thing, and if we kept Solr this would also be a good opportunity to upgrade from Solr 3.5 to Solr 7).

If we do this, @mjpost would have to decide what Google account should be the owner of the Custom Search and create it; anyone can be made admin afterwards, and I can take care of setting it up correctly and integrating it.

mjpost commented 5 years ago

I started this process and added you as an admin. I'm looking into the non-profit stuff, if you want to take care of the site side of things.

villalbamartin commented 5 years ago

I can shed some light on this.

The reason for Solr to be so integrated with the Rails application is because the application was built on top of a template for developing Rails apps with integrated search, called Blacklight. That said, we don't need to go through Blacklight if we want custom search - we could go directly through Solr, or even through a different search engine.

I think using Google Custom Search would work as an agile solution, but I've written before about why I think an in-house search would work better in the long term. Searching for papers by author and year, for instance, is something that I find useful and would be gone if we used Google.

I have toyed with the idea of using an XML database directly, since it would allow us to search without having to keep a parallel database - we would just feed it the canonical XMLs and the schema, and the DB would make its magic. But I realize that this would require development time, and that's something we are a bit short on.

At the moment I'm quite busy, as evidenced by my significantly-lower participation in issues, but I could hopefully invest some time for this in two or three weeks if there is interest. In the meantime, getting Google up and running seems straightforward and would be an okay solution.

mjpost commented 5 years ago

Thanks, Martín. This sounds good to me: we roll out with a custom Google search, and also build our own solution as time permits. We can then compare them.

Note that a third option is bibsearch, a tool that @davvil and I wrote that could be adapted here as a CGI app. It's quite fast and also based on a custom database. David has indicated interest in continuing to maintain that, and this could remove some redundancy. I also plan to write an Alfred plugin that allows quick search from the OS X desktop.

mbollmann commented 5 years ago

Update: It seems that the customizations I was hoping to do with Google Custom Search are not possible after all.

The documentation for Google CSE has a nice section on rich result snippets that demonstrates customization of search results based on structured data. However, the links explaining how to do this are either 404 or link to pages from CSE v1 which is no longer supported. All further info I could find online also refers to this old v1 search element. In conclusion, it appears that this used to be a thing but is no longer possible with the current CSE v2 (at least not with the free version), even though the docs suggest otherwise.

This makes a custom-built solution more appealing again in the long term.

jeisner commented 5 years ago

I asked about this on another thread:

Ideally it would be possible to submit a query like author:"matt post" beam search (and/or the user-friendly equivalent that uses a drop-down form to specify metadata field restrictions like author:). But that's a bit tricky: if Google's indexing can't be changed, then I guess the implementation would be to submit the modified query "matt post" beam search to Google and then filter the results using the author metadata.

@mbollmann replied there:

I think that's only possible if we use Google's JSON API, so we can query and filter search results via our own custom JavaScript. Unfortunately, that costs $5/1000 queries.

I see. But couldn't we just parse the HTML results?

(I suppose that requires maintenance if the HTML format changes, but maybe that doesn't happen too often. The other issue would be having to retrieve multiple pages of results to get the first page of filtered results.)

mbollmann commented 5 years ago

I see. But couldn't we just parse the HTML results?

Apart from the question whether that's against the TOS of Google's Custom Search (which I'm not sure of right now), I don't see how. In contrast to the results from the JSON API, there is no metadata (such as author information) in the HTML results AFAICT.

jeisner commented 5 years ago

Presumably the page title is also the paper title, and so the start of the page title (shown in blue for each search hit) can be used to index into our database to retrieve the other metadata.

On the rare occasion that the hit is consistent with multiple papers, be generous and keep it if any of those papers match the search criteria. If the hit is not consistent with any papers for some reason, be generous and keep it, or else fall back to some kind of fuzzy search (like agrep).

mbollmann commented 5 years ago

There is no database to query anymore though, since the whole site is statically generated now.

jeisner commented 5 years ago

I see. But there is a static bibtex database on the site. Search results are necessarily dynamic: the improved search box would thus have to talk to a process that serves up results by querying Google and filtering those results. When that process started up, it could read the bib files and construct a simple in-memory index (e.g., a hash on the start of the canonicalized title).

mbollmann commented 5 years ago

True, but do you see any advantages over a fully server-side search solution anymore then? @mjpost suggested a CGI app based on bibsearch above, for example. Once we introduce a server-side component, we might as well go all the way, no?

jeisner commented 5 years ago

I was thinking of two advantages.
(1) Speed and server load. I bet < 5% of the queries will require special handling -- special search operators like author:XXX and year:YYY are only included by power users and are often unnecessary even for them (since just XXX and YYY will work pretty well).
(2) Better free-text search. Google has invested a great deal of work in tuning their relevance ranking and their data structures. The current search box therefore efficiently handles synonymy, morphology, phrases, term prominence, term proximity, and PageRank of the paper (as determined via citations from the whole web and not only from within the Anthology). I wouldn't want to give that up just to add metadata queries. The solution that I was suggesting would call Google on all searches and thus get these benefits on all searches.

jeisner commented 5 years ago

Hmm -- in the upper-right corner of the Google search results, there is a "Sort by:" dropdown that lets you choose either "Relevance" or "Year of Publication". Where did that come from?

There are also tabs to search "Authors," "Events," and "Paper Metadata." I believe this means to restrict the search only to certain kinds of pages on the site. However, it took me a few minutes to come up with that theory. I fear that these tabs might be misinterpreted as saying "please list the authors / events of all the papers you just found." That would be useful but is not what the tabs currently do. For example, a full-text search on "puns" might find that the most relevant papers are by Jo Bloggs, but she won't be on the author tab unless her author page contains the word "puns" (e.g., in a paper title).

mbollmann commented 5 years ago

It's the customizations that Google Custom Search lets you do (and that I added). "Year of Publication" sorts by <meta name="citation_publication_date"> on the paper pages, which I thought might be a useful option. The tabs indeed filter by certain parts of the site. I would have liked to customize it even more (as I elaborated on both here and in the general feedback thread), but haven't found a way to do so. Suggestions on how to improve this within the capabilities of GCS are certainly welcome!

About your previous suggestion, I'll have to think about it a bit more. I can see the advantages of piggybacking on Google, but I'm still not sure it's not too hacky to be maintainable in the long run and/or against Google's TOS. That said, maybe just having a server-side & customized option (e.g. based on bibsearch) alongside the Google one might already give users more options? As I said, I'll have to think about it more.

mbollmann commented 4 years ago

The "Paper Metadata" tab is now broken after the change to a flat directory hierarchy in #513, and I don't see how it can be restored.

In Google Custom Search, we can assign labels to URLs or some very simple URL patterns, which was used to label all URLs beginning with https://aclweb.org/anthology/papers/ as "paper-metadata". However, now that these pages are directly under anthology/, I don't see a way to single out these pages using the functionality provided by GCS. They only support simple wildcards of the type "this string has to be appear somewhere in the URL", but nothing sophisticated enough to single out our paper URLs AFAICS.

I have discussed options to switch to a different search engine with @mjpost, but still have to prepare an overview and a suggestion. Once I do, I will open a separate issue to discuss this, but in the meantime I wanted to note this problem with GCS here.

mbollmann commented 2 years ago

Just cross-referencing various search-related issues to have them in one place.

mbollmann commented 1 year ago

Leaving some notes here regarding options for building our own custom search engine, since I've been pondering about search functionality for quite a while now.

Server-side search

In 2020, I built a search engine prototype that used server-side search via Meilisearch. It has been offline and unmaintained for a while now since there was no clear path towards integrating it into the Anthology. However, I think there are some good arguments in favor of picking this idea up again:

In other words, there are now open-source, commercial-grade solutions for both backend and frontend via Meilisearch, which could make this solution a bit of a safer bet now regarding long-term stability and maintainability.

Client-side search

I've also wondered about the feasibility of purely client-side search. There are tons of libraries for that, such as Fuse.js or Lunr.js, and if you try an interactive demo/comparison and generate, say, 100000 titles to search in, it still performs blazingly fast (a few milliseconds per search on my machine).

The biggest bottleneck for this kind of solution, IMO, is getting the data to the user.

Let's take a paper index as an example that contains only Anthology ID, paper title, and author last names. A pre-built search index for Lunr.js comes out at around 32 MB, or 8.7MB after gzip compression. The unindexed JSON data (which would need to be indexed client-side every time) is 14 MB, or 4 MB compressed.

I'm not sure what's acceptable in terms of data volume for websites to transfer these days, but considering that this will only grow, and doesn't even include abstracts or other metadata yet, I'm a bit skeptical that this is a good way to go.

On the plus side, maintaining a purely client-side solution would most definitely be easier.


Anyway, if anyone's still interested in making this happen eventually, I'm happy to hear other people's thoughts as well. @mjpost @akoehn

jeisner commented 1 year ago

Do any of these solutions support dense retrieval? That is, embed queries and passages into a vector space, so that exact word match isn't required. I'm asking because I assume that Google Custom Search must be evolving in this direction.

mbollmann commented 1 year ago

I feel a quick, site-internal metadata search is somewhat complementary to a dense retrieval system like you're describing. I'm really mainly talking about the former here, as I think the current Google Custom Search doesn't fulfill this role very well, and we should probably not build and support our own homegrown solution to the latter.

Note that exact word match isn't required for most of these search solutions I'm talking about either. They typically employ some form of fast, fuzzy matching. (Although of course that doesn't handle synonymy etc., if that's what you were thinking.)

I have wondered if it's an option to collaborate with Semantic Scholar somehow. They already provide an API that provides metadata about whether a paper belongs to the ACL Anthology, but last I checked they didn't support search queries that filtered based on this.