Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
679 stars 131 forks source link

Add word count filter criterion to search #3032

Closed jiru closed 11 months ago

jiru commented 1 year ago

Closes #1954. No reindexation required for this to work since the word count is already indexed by Manticore. Some large chunks of code changes in tests/ are just code refactoring.

jiru commented 1 year ago

@alanfgh or anyone else: feedback welcome on wording:

img img

alanfgh commented 1 year ago

The wording looks fine to me. But we could just as well start from 1 rather than 0, since we don't have any empty sentences (or shouldn't, anyway). I think that would raise fewer questions.

Thanks for asking, @jiru .

LBeaudoux commented 1 year ago

@jiru thanks for adding this feature.

Why not use a range slider? This would be very helpful for those who prefer medium length sentences.

It would be also nice if you could set your default sentence length range on your settings page so you don't have to do it every time you search.

jiru commented 1 year ago

@alanfgh Thanks for checking the wording. In theory, there are no 0-words sentences, but it actually sometimes happens as a result of some bug that needs to be fixed, so searching for 0-words sentences actually helps tracking such bugs. I agree it would be less confusing to default to 1 instead of 0.

@LBeaudoux About allowing both upper and lower limit, yes it would be a good improvement. About using the range slider, I am not sure of the benefits in terms of usability because the min/max numbers are not so visible. Besides, how to decide on the upper limit of the slider? By the way, Tatoeba is using AngularJS Material, not Angular Material, so we cannot use what you linked.

About adding a new user setting, if we are to add a new setting for every little thing like that, it would become very complicated for both developers and users. I would rather try a different approach, such as improving the ranking algorithm (or adding a new one) that favor sentences that are not too long nor too short.

LBeaudoux commented 1 year ago

About allowing both upper and lower limit, yes it would be a good improvement.

What do you think of 2 custom inputs taking optional values and displaying "min." and "max." as placeholders?

Length:min.⊠ to max.⊠ word(s)

jiru commented 1 year ago

@LBeaudoux Thanks for the suggestion, it totally makes sense.

I’ve updated this pull request, the form now looks like this: image

ckjpn commented 11 months ago

Note that the word "Length:" doesn't fit.

Screen Shot 2023-07-30 at 7 21 28

One idea that might solve this problem would be to change "Length" to "Length in words:" and eliminate the "word(s)" part. This may give the same information, but properly word-wrap in the space provided.

jiru commented 11 months ago

@ckjpn I updated the design on https://dev.tatoeba.org/, can you check if it still word-wraps the label on your end?

ckjpn commented 11 months ago

It looks good on the several Macintosh browsers I tested it on.

ckjpn commented 11 months ago

This gets unexpected results.

Length = 8, search Japanese

="何と読む"

https://dev.tatoeba.org/en/sentences/search?from=jpn&has_audio=&native=&orphans=no&query=%3D%22%E4%BD%95%E3%81%A8%E8%AA%AD%E3%82%80%22&sort=relevance&sort_reverse=&tags=&to=&trans_filter=limit&trans_has_audio=&trans_link=&trans_orphan=&trans_to=&trans_unapproved=&trans_user=&unapproved=no&user=&word_count_max=8&word_count_min=8

jiru commented 11 months ago

This happens because and are treated as words. @ckjpn Can you open a new issue please?