Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
714 stars 132 forks source link

Can't search for sentences excluding word(s) #3053

Closed tahmid02016 closed 1 year ago

tahmid02016 commented 1 year ago

To Reproduce Steps to reproduce the behavior:

  1. Go to Totoeba (https://tatoeba.org/en).
  2. Search a word with - prefix to get sentences excluding that word.
  3. Get error message: "An error occurred while performing the search."

Example: 1. Go to https://tatoeba.org/en 2. Search -Tom. 3. Get error message.

Expected behavior When searched a word with - prefix, all the sentence excluding that word should be provided as search result. Example: When a user searches -Tom, all sentences excluding the word Tom should come as search result.

vinkaks commented 1 year ago

Underlying error is a Sphinx error

query error: query is non-computable (single NOT operator)"

Error is due to how Sphinx search works. Sphinx needs to get a list of results from which it can then remove the items matching the NOT

ckjpn commented 1 year ago

@tahmid02016

See the Wiki.

How to find English sentences without "the", "a" or "an" https://en.wiki.tatoeba.org/articles/show/text-search#how-to-find-english-sentences-without-%22the%22,-%22a%22-o

This explains how you can do what you want to do

tahmid02016 commented 1 year ago

@ckjpn, I know the wiki page states a hack to bypass the error.

If you are determined to get as many results as possible, you can search for words that start with any letter of the alphabet, after putting a minus before each word that you do not want (though this query will take a long time): -the -a -an a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z

However, this search query takes a lot of time to load. Beside that, It is not a great solution for other language that has quite a number lf letter in their alphabet. For example, Bangla has 50 letters, Thai has 72 letters and Khmer has 74 letters in their alphabet.

jiru commented 1 year ago

This is a limitation of the search engine Manticore. The rationale for disabling such queries is that they are very resource-intensive. In newer versions of Manticore, there is an option to unable them nonetheless, but may I ask if it is worth it? Searching for -Tom is certainly going to return an enormous amount of sentences, much more than the 1000 maximum browsable. Do you really need that much? What are you trying to achieve?

ckjpn commented 1 year ago

Note that Google Search doesn't allow the following either.

-Tom https://www.google.com/search?q=-Tom&

tahmid02016 commented 1 year ago

Closing this issue as will probably not be solved.

jiru commented 1 year ago

@tahmid02016 Sorry if my answer was a bit abrupt, I was just asking why because I wish I can help you solve this issue. But I cannot do so unless I know your intention, your original problem. Simply enabling "-Tom" searches is not much of an option because such query would take even more time than the mentioned hack and overload the server. But there may be other ways to solve the original problem that prompted you to open this issue. For example, if you are a developer wanting to work on a subset of the corpus that excludes all the Tom sentences, I can suggest another solution, to download the corpus as a CSV file and filter it.