Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
709 stars 132 forks source link

Some sentences do not appear in the search results #1952

Open Yorwba opened 5 years ago

Yorwba commented 5 years ago

Wall thread: https://tatoeba.org/eng/wall/show_message/32577#message_32577

There are 1679 Mandarin sentences with audio, but a search for them only returns a small percentage, though the number appears to be growing slowly.

Removing the filtering for sentences with audio as well as restrictions on orphaned and unapproved sentences yields 450 results, despite there being 60910 Mandarin sentences in total.

This also applies to Abkhaz, with the search only finding 25 sentences although there should be 27.

The lists are short enough that I could compare them and find a sentence that should appear, but didn't: #3892187 (Иҭабуп!). Explicitly searching for that word doesn't find anything. However, the same search works on dev.

That leads me to the question of how the Manticore configuration differs between dev and prod, if at all.

Maybe this is related to the index corruption in #1944, but the symptoms look very different.

trang commented 5 years ago

The configurations are the same on dev and prod. But there was a change in the config during the upgrade to Manticore 3: https://github.com/Tatoeba/tatoeba2/commit/c78d8afc8dedba842472991d54a4a11b5900b7ef

It feels like we have something wrong in the config related to these kill-lists, but it isn't noticeable on dev because we don't have much activity there.

~There are now only 23 sentences found in the search where you found 25 sentences by the way.~ Edit: the search returns 27 results now, as I have rebuilt the indexes.

AndiPersti commented 5 years ago

There is still a problem with the search.

#5978209 isn't found on prod and is neither found on dev because for some reason it doesn't exist there.

#3230000 isn't found on prod but is found on dev. The same for #1980618, #950300, #3239167, #380264, and #1892592.

jiru commented 4 years ago

#5978209 isn't found on prod

This one is orphan, that’s why it doesn’t show up by default. You need to enable orphan sentences in the search criteria.

and is neither found on dev because for some reason it doesn't exist there.

This is the expected behaviour. Dev is always outdated because we don’t update it unless we have something to test there.

#3230000 isn't found on prod but is found on dev.

I confirm. It appears the index is okay and the sentence was just not included. Reindexing ORV solved the problem.

The same for #1980618, #950300, #3239167, #380264, and #1892592.

Yeah, except for #1980618 and #380264. They are correctly indexed now.

So reindexation seems to solve the problem. Maybe this bug is just a side effect of #1944.

AndiPersti commented 4 years ago

This is the expected behaviour. Dev is always outdated because we don’t update it unless we have something to test there.

Yes, I understand that. But I just found it strange that it doesn't appear on Dev although there were logs for it and it's from 2017 (AFAIK the database on Dev is from May 2019).

So reindexation seems to solve the problem. Maybe this bug is just a side effect of #1944.

Yes, it looks like reindexing the database is necessary. I've just retested Abkhaz and although Trang said that all 27 sentences were found after rebuilding the index in September, it again "lost" one sentence: #4256204. Searching for these words currently doesn't find anything.