commons-app / apps-android-commons

The Wikimedia Commons Android app allows users to upload pictures from their Android phone/tablet to Wikimedia Commons
https://commons-app.github.io/
Apache License 2.0
1.03k stars 1.24k forks source link

Evaluate the usages of the `search` API for category search in the app #3641

Open sivaraam opened 4 years ago

sivaraam commented 4 years ago

We seem to be using the search API for category search in a couple of places in the app. As mentioned in the comments in #3179 [ref 1] [ref 2], it has some problems and is not the apporpriate API for category search. So, it's best to evaluate the usage of that API in the app and see if we could find proper replacements.

One such usage seems to be for sugesting categories based on the titles of the images. As far as I can think, that's hardly useful. I think it's best if we don't do such a category search based on image title at all.

I'm not sure about the other usage, though.

maskaravivek commented 4 years ago

One such usage seems to be for sugesting categories based on the titles of the images.

Yes currently this is one of the ways for suggesting categories but there are few other types of suggestions ie based on location of image, previously used categories etc which are combined with this result. Personally I find the suggestions useful as otherwise I won't know what to search. For eg. if I am uploading an image with title as Madurai temple, it suggests me multiple categories that match this name.

If the issue is related to it not giving best results when the user doesn't type in case sensitive manner then IMO removing the option altogether isn't a good option. I assume more experienced users are well versed with how Commons search works and would be used to searching accordingly.

Our goal should be to simply make our search results consistent with web search results.

sivaraam commented 4 years ago

One such usage seems to be for sugesting categories based on the titles of the images.

Yes currently this is one of the ways for suggesting categories but there are few other types of suggestions ie based on location of image, previously used categories etc which are combined with this result.

I realise that and to be clear I'm not against the other suggestions. Just about the suggestions via title.

Personally I find the suggestions useful as otherwise I won't know what to search. For eg. if I am uploading an image with title as Madurai temple, it suggests me multiple categories that match this name.

I understand the convenience here. But this comes with a caveat which I describe below.

If the issue is related to it not giving best results when the user doesn't type in case sensitive manner then IMO removing the option altogether isn't a good option. I assume more experienced users are well versed with how Commons search works and would be used to searching accordingly.

The issue is not that the search API behaves case-sensitively. Moreover, the search API actually works case-insensitively. Here's an example API query to prove that:

https://commons.m.wikimedia.org/w/api.php?action=query&format=json&formatversion=2&generator=search&gsrnamespace=14&gsrsearch=intitle:covid&gsrlimit=25&gsroffset=0

Note that gsrsearch=intitle:covid (covid - in small cases) which is the query. Observe that the results have pages that have COVID (covid - in all caps). So, the search is case-insensitive.

The actual problem is how the search API works. It just searches the wiki pages that exist in the Category: namespace (gsrnamespace=14). But the fact is that the actual set of categories isn't the set of wiki pages that exist in the Category: namespace.

Is it not possible to use the article search engine with (invisible) category: prefix instead ? Wouldn't that search for pages in the category namespace, rather than actual categories? Some categories don't have associated pages, and you can create pages in the category namespace for non-existent categories.

[Source]

See also: https://github.com/commons-app/apps-android-commons/issues/3179#issuecomment-612052320

sivaraam commented 4 years ago

Our goal should be to simply make our search results consistent with web search results.

Well, if we're being serious about category addition here's we shouldn't be using the search API for category addition for the reason I describe in my previous comment. The allcategories API seems to be only one that does the job properly. Do enlighten me if I'm ignorant of some other magical API which is a lot better than allcategories for category search.

Also, if you give the category addition interface of Visual editor a shot you'll realise that it seems to be using allcategories API too. The phab ticket T59302 is all about showing case insensitive category suggestions in the Visual editor and guess what, it's still open.

image

maskaravivek commented 4 years ago

@sivaraam I didn't realize that this issue was about using some other API for title category suggestion.

As far as I can think, that's hardly useful. I think it's best if we don't do such a category search based on image title at all.

This comment of yours confused me.

Am all in for using some other API if it gives better results.

sivaraam commented 4 years ago

@sivaraam I didn't realize that this issue was about using some other API for title category suggestion.

Apologies for the lack of clarity. I really thought I clarified that in the first paragraph.

Am all in for using some other API if it gives better results.

The problem is: there isn't! The search API gives nice results but it isn't suited category search. The allcategories API only supports a case-sensitive prefix search, AFAIK. So, sending the title to it is not a great idea as won't get better results if we get any results at all. That's the reason I suggested removing the category suggestions using the title altogether.

I also had a look at the other usage of the search API for category search. IIUC, it comes into picture in the "Explore" screen when searching for categories in the "Categories" tab. I wonder what we could do about this. The search API has the limitation I describe in my previous comment. If we instead use allcategories API for the category search it means we would be doing a case-sensitive prefix search which would not give great results. But that's our only option, AFAIK.

Please share thoughts on these.

macgills commented 4 years ago

It has been hard to follow all the discussion on category search

The allcategories API only supports a case-sensitive prefix search

I may be partly remembering but does this mean only the first letter is case sensitive in the search? Is there any solution of multiple request we combine and filter out the distinct categories?

sivaraam commented 4 years ago

The allcategories API only supports a case-sensitive prefix search

I may be partly remembering but does this mean only the first letter is case sensitive in the search?

Nope. There are a couple of things:

  1. Generally, MediaWiki treats the first letter of the page title in a case-insensitive manner in any case. [ref]. Keep that in mind, always. This might help you sometime in the future; like it helped me in figuring out why the testcase added in PR #3326 passed :)
  2. The allcategories API does a case-sensitive prefix search. By this I mean that a search term sent to that API does a case-sensitive prefix match of the category titles. For example, if I send the search term foo to the API:
    1. "prefix search" means it would only return categories that begin with "foo"
    2. "case-sensitive" means the case should match too. So, I would get the following results:
      * Foo bar
      * Foo club Factory
      * Foo BEACH

      ... but I would not get the following results:

      * FOO bar
      * Bar foo

Hope that clarifies your doubt.

Is there any solution of multiple request we combine and filter out the distinct categories?

That's not a good idea even in theory. To properly simulate a case-insensitive search via multiple queries we would have to form all combination of cases of the characters in the search term. So, the number of queries would grow exponential w.r.t the number of characters in the search term. For example, consider that the search term is foo and we to simulate a case-insensitive search using an API that only supports a case-sensitive search. Then we would have to send a query to the API for all of the following words as the search term and then combine the results and de-duplicate them.

* foo
* foO
* fOo
* fOO
* Foo
* FoO
* FOo
* FOO

I think that gives you an idea about why this is not possible.

macgills commented 4 years ago

Yeah but how many categories HaVe A cASe LiKe saRCaStic SPongeBOB?

I bet all lower case, originally Typed case, ALL CAPS, Capital Case would get us 99% of results, enough to trick users anyhow.

This is for sure a bandaid but do we have a better solution? Or do we just do nothing and close this ticket and wait for an api that supports this?

sivaraam commented 4 years ago

Yeah but how many categories HaVe A cASe LiKe saRCaStic SPongeBOB?

I bet all lower case, originally Typed case, ALL CAPS, Capital Case would get us 99% of results, enough to trick users anyhow.

Oh, I wouldn't bet on that. Particularly given all the interesting categories titles that you could find in Special:Categories.

Also, I feel that this trick would give a false sense of case-insensitivity to the users making them wonder why the search seems to behaving case insensitive in some cases and case-sensitive in others. To give a real world example, consider the following:

image

(image courtesy: https://github.com/commons-app/apps-android-commons/issues/3582#issuecomment-603744071)

Think about what would happen when the search word is "flowers in a". I don't think we can properly manipulate that search word in a way that our case sensitive API would return the categories "Flowers in Ain", "Flowers in Angus", etc. would be returned. This is just an example. A lot of this cases would happen when we don't a proper simulation of the case-insensitive search using the case-sensitive API. These cases make the user wonder if the search is really case-insensitive or not. A proper simulation would be costly, though. So, it's best if we spare them the confusion and just say that the category search is limited to a case-sensitive one until we find a proper solution for this.

This is for sure a bandaid but do we have a better solution? Or do we just do nothing and close this ticket and wait for an api that supports this?

For the reasons I mentioned above and others I mention in a comment in #3179, I think this our best way forward. I others have better ideas, please share them.

maskaravivek commented 4 years ago

@sivaraam I am not able to follow your final suggestion.

From what I understand, the title based category suggestion are not providing the best results. If this might confuse the users we can add a (i) button explaining how the suggestions are fetched. Apart from that, I don't think we can do much.

sivaraam commented 4 years ago

@sivaraam I am not able to follow your final suggestion.

Apologies for not being clear. My last couple of comments apply mostly to #3179.

From what I understand, the title based category suggestion are not providing the best results. If this might confuse the users we can add a (i) button explaining how the suggestions are fetched. Apart from that, I don't think we can do much.

I'm not sure about the lack of clarity w.r.t how we suggest categories. The concern I have with the suggesting categories based on title is:

  1. We now use the search API for showing the suggestions. We should actually be using the allcategories API for the reasons I mention above and in #3179.
  2. The allcategories API does a case-sensitive prefix search. So, using it to show category suggestions based on the file title wouldn't be a great idea as we would hardly get any results.

So, I suggest that we remove the category based title suggestions altogether. Hope that clears your confusion.

maskaravivek commented 4 years ago

As already discussed above, it is better to show some category based suggestions rather than removing it altogether.

Tagging @misaochan for her opinions.

misaochan commented 4 years ago

If we conclude that the allcategories API provides better results, I am OK with switching to that API. However, I don't see why that would require removing title-based categories. Can we not just query allcategories for our other suggestions, query search for title-based categories, and concat the results?

If this is not doable, I feel that even providing a few title-based suggestions (even if they are case-sensitive and therefore not numerous) is better than not displaying any at all.