Seeking a good search engine for PhysioNet

The current PhysioNet search function is not great (previous issues: #349, #1971). We would like to replace it with something based on a "real" information-retrieval engine, while also allowing more powerful and user-friendly queries.

There are a few options and in this issue I'll try to list advantages/disadvantages of each.

Requirements:

Free and open-source software
Reasonable security support

Good to have:

Django integration - Haystack (https://haystacksearch.org/), for example, makes it easy to index and search objects in the Django ORM
Language support - PhysioNet only publishes projects written in English, but we would like the platform to be international
Exact word searching - ability to search for a term without stemming or synonyms (often written +foo or "foo")
Phrase searching - ability to search for an exact phrase ("foo bar")
Range queries - e.g. "projects published between 2021-06-01 and 2021-09-01"
Faceting - e.g. "list the distinct authors of matching projects and the number of matching projects for each author"
Collapsing - e.g. "search for published projects matching the query, then list distinct core projects ordered by relevance"
Synonyms - e.g. treating ecg and electrocardiogram as equivalent
User-friendly query parser - if the query parser supports complex syntax, providing diagnostics so you can understand why your query isn't working

Some options we might consider:

Xapian
Whoosh
Solr
OpenSearch
Manticore
PostgreSQL

Xapian (https://xapian.org/)

Implementation language: C++ Latest release: 2023-11-06

Free and open-source software: Yes
Reasonable security support: probably
Django integration: Yes (xapian-haystack)
Language support: Armenian, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish
Exact word searching: Yes
Phrase searching: Yes
Range queries: Yes
Faceting: Yes
Collapsing: Yes
Synonyms: Yes
User-friendly query parser: No

Xapian is implemented in C++, but it's also a well-established package with security support in Debian. It has a Python wrapper which is maintained by the Xapian developers, but is not in PyPI (https://trac.xapian.org/ticket/807). The most reasonable option I think would be to use --system-site-packages or something equivalent.

The query parser supports prefixes for field searches, but if you type a prefix it doesn't understand, it seems to be silently ignored. It's possible to dump the AST but this is not super-friendly.

Searching for dates and ranges is possible, but difficult to do correctly.

Whoosh (https://pypi.org/project/Whoosh/)

Implementation language: Python Latest release: 2016-04-04

Free and open-source software: Yes
Reasonable security support: doubtful
Django integration: Yes (django-haystack)
Language support: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portugese, Romanian, Russian, Spanish, Swedish
Exact word searching: No
Phrase searching: Yes (stemmed only)
Range queries: Yes
Faceting: Yes
Collapsing: No (but parent/child documents might be an alternative)
Synonyms: No
User-friendly query parser: ???

The Whoosh package is pure Python, and is thus slow, less likely to have security problems, and available on PyPI. However, it also appears to be unmaintained.

It appears to only index stemmed forms and not make a distinction between foo and "foo". This might be the fault of haystack and not whoosh itself.

Searching for dates and ranges is possible, but difficult to do correctly.

Solr (https://solr.apache.org/)

Implementation language: Java Latest release: 2023-10-15

Free and open-source software: Yes
Reasonable security support: Yes
Django integration: Yes (django-haystack)
Language support: Arabic, Bulgarian, Catalan, CJK, Czech, Danish, German, Greek, Spanish, Basque, Persian, Finnish, French, Irish, Galician, Hindi, Hungarian, Armenian, Indonesian, Italian, Japanese, Latvian, Dutch, Norwegian, Portuguese, Romanian, Russian, Swedish, Thai, Turkish
Exact word searching: No
Phrase searching: Yes (stemmed only)
Range queries: Yes
Faceting: Yes
Collapsing: Yes
Synonyms: Yes
User-friendly query parser: somewhat

Solr is not in Debian; however, it's written in Java, so less likely to have security problems, and it works via an HTTP API so the search engine can run with minimal privileges.

The default query parser will report an error if the input has a syntax error or an unknown field prefix; the "dismax" and "edismax" parsers will not. There's also a debug option that outputs the AST as a string.

Recommendations for "how do I do exact word/phrase searching with Solr" seem to boil down to "define two fields with duplicate data". But there doesn't seem to be a friendly way to handle this with the standard query parsers, and I don't think Haystack supports this directly.

Searching for dates and ranges is possible, but difficult to do correctly.

MIT-LCP / physionet-build