alephdata / aleph

Search and browse documents and data; find the people and companies you look for.
http://docs.aleph.occrp.org
MIT License
1.99k stars 265 forks source link

FEATURE: Allow/improve partial search #3747

Closed andresmrm closed 4 days ago

andresmrm commented 2 months ago

Is your feature request related to a problem? Please describe. Sometimes the searched term appears without space separation to another word (like nº123, instead of nº 123, so the query doesn't find anything if I just use 123, I need to search for nº123).

Describe the solution you'd like I would like to search for 123 and find nº123.

Describe alternatives you've considered Sometimes using ??123 can help, but not if the number of chars vary.

As discussed in Slack, I've managed to make queries directly to ElasticSearch to use regex queries. But they were too slow (~3s each) and I needed to query a huge list of terms. So I ended up doing regular queries for the most common patterns (~30ms each). For example, in my case the terms generally appear like 0123456789 or 012.345.678-9, so I queried each version of the term for each term (2x30ms=60ms << 3s). But I gave up less common cases, like nº123.

It maybe good to allow regex queries, even if slow, for when you just need to search for a few terms. And, if possible, make regex faster or offer another type of partial match.

tillprochaska commented 2 months ago

Just for context, you can use wildcard and regex queries in Aleph using the ElasticSearch query string syntax.

As you already noticed, both wilcard and regex queries are computationally expensive at search time which makes them slow. While there are options to speed up such queries, these require indexing contents differently (e.g. using ngrams) which usually comes at a significantly higher cost for ingesting and storing the data. This makes it a difficult trade-off.

andresmrm commented 2 months ago

Yes, I understand it's hard to make it faster... =/

I knew about the "abc?" query, but not the "abc*". Maybe it should be added to the docs? https://docs.aleph.occrp.org/users/search/advanced/

Regex search "abc.*" doesn't seem to work for me from Aleph search page. Only when accessing ES directly.

Edit: Ops, I see now why. It should be "/abc.*/". Sorry for the confusion.

tillprochaska commented 4 days ago

Hi @andresmrm, sorry for the late reply. Thanks for your suggestion, I have added a section to the docs that links to the full ES query syntax reference.