MIT-LCP / physionet-build

The new PhysioNet platform.
https://physionet.org/
BSD 3-Clause "New" or "Revised" License
56 stars 20 forks source link

Seeking a good search engine for PhysioNet #2180

Open bemoody opened 10 months ago

bemoody commented 10 months ago

The current PhysioNet search function is not great (previous issues: #349, #1971). We would like to replace it with something based on a "real" information-retrieval engine, while also allowing more powerful and user-friendly queries.

There are a few options and in this issue I'll try to list advantages/disadvantages of each.

Requirements:

Good to have:

Some options we might consider:

bemoody commented 10 months ago

Xapian (https://xapian.org/)

Implementation language: C++ Latest release: 2023-11-06

Xapian is implemented in C++, but it's also a well-established package with security support in Debian. It has a Python wrapper which is maintained by the Xapian developers, but is not in PyPI (https://trac.xapian.org/ticket/807). The most reasonable option I think would be to use --system-site-packages or something equivalent.

The query parser supports prefixes for field searches, but if you type a prefix it doesn't understand, it seems to be silently ignored. It's possible to dump the AST but this is not super-friendly.

Searching for dates and ranges is possible, but difficult to do correctly.

bemoody commented 10 months ago

Whoosh (https://pypi.org/project/Whoosh/)

Implementation language: Python Latest release: 2016-04-04

The Whoosh package is pure Python, and is thus slow, less likely to have security problems, and available on PyPI. However, it also appears to be unmaintained.

It appears to only index stemmed forms and not make a distinction between foo and "foo". This might be the fault of haystack and not whoosh itself.

Searching for dates and ranges is possible, but difficult to do correctly.

bemoody commented 10 months ago

Solr (https://solr.apache.org/)

Implementation language: Java Latest release: 2023-10-15

Solr is not in Debian; however, it's written in Java, so less likely to have security problems, and it works via an HTTP API so the search engine can run with minimal privileges.

The default query parser will report an error if the input has a syntax error or an unknown field prefix; the "dismax" and "edismax" parsers will not. There's also a debug option that outputs the AST as a string.

Recommendations for "how do I do exact word/phrase searching with Solr" seem to boil down to "define two fields with duplicate data". But there doesn't seem to be a friendly way to handle this with the standard query parsers, and I don't think Haystack supports this directly.

Searching for dates and ranges is possible, but difficult to do correctly.