CentreForDigitalHumanities / I-analyzer

The great textmining tool that obviates all others
https://ianalyzer.hum.uu.nl
MIT License
7 stars 2 forks source link

Search for search terms no further than *n* words apart #1648

Open BeritJanssen opened 2 months ago

BeritJanssen commented 2 months ago

Is your feature request related to a problem? Please describe. This request came as a potential future request from the People & Parliament team: they would like to search for two search terms and make sure that they occur within close proximity to each other, not just anywhere in a document

Describe the solution you'd like It seems the Elasticsearch intervals query might fit the bill. So I think the technical implementation should not be too much of an issue, but how to reflect this different type of query of the UI requires some consideration.

Describe alternatives you've considered We might also post-process documents which are matches to the simple query string query - but I'm not sure that would be a better solution, as the UI question remains.

Additional context Add any other context or screenshots about the feature request here.

jgonggrijp commented 2 months ago

Does something like "firstterm secondterm"~5 OR "secondterm firstterm"~5 not already do what is asked?

lukavdplas commented 2 months ago

OR is redundant (and would be | in simple query string syntax), but yes, that would work :+1:

I was somewhat suprised because the query documentation on I-analyzer suggests that ~ for phrases has rather different semantics. It turns out that contradicts the elasticsearch manual.

That said, the query you formulate here isn't something I would expect a non-programmer to come up with. I would support making a more beginner-friendly option for this as part of #1436

jgonggrijp commented 2 months ago

I think the query I suggested will match secondterm apple banana cherry date elderberry firstterm, while a simplified version without the OR and the second branch would not. Other than that, I agree those two queries would be equivalent.

lukavdplas commented 2 months ago

Ah, to clarify: I meant you could leave out the disjunction operator, so "firstterm secondterm"~5 "secondterm firstterm"~5.