commons-app / apps-android-commons

The Wikimedia Commons Android app allows users to upload pictures from their Android phone/tablet to Wikimedia Commons
https://commons-app.github.io/
Apache License 2.0
1k stars 1.18k forks source link

Make category search non case-sensitive and more user friendly #3179

Open misaochan opened 4 years ago

misaochan commented 4 years ago

Received a report from a 2.11 user on our FB page that category search is case-sensitive for her, which means that sometimes she'll type the right category in the search field but nothing will show up.

AFAIK the MW API that we use is inherently case-sensitive, but the upload wizard seems to be able to find a way around that and produces the same category suggestions regardless of case.


Edit: Apart from the case sensitivity, the allcategories API also has a problem of doing a prefix match. This does not give a great UX. We should explore ways to fix this too.

nicolas-raoul commented 4 years ago

Maybe if we convert everything to lowercase then the server performs a non-case-sensitive search? That's just an hypothesis, I have not tried.

misaochan commented 4 years ago

@nicolas-raoul Possible! We'll try it out with a direct query first.

ankit-kumar-dwivedi commented 4 years ago

Can I take this issue?

misaochan commented 4 years ago

@ankit-kumar-dwivedi please feel free!

kbhardwaj123 commented 4 years ago

@misaochan Is this issue free to be worked upon? if so can i take it?

ankit-kumar-dwivedi commented 4 years ago

Hey! Yes sure you should start working on it as I'm not working on it right now.

On Sun, Jan 12, 2020, 7:21 PM Kshitij Bhardwaj notifications@github.com wrote:

@misaochan https://github.com/misaochan Is this issue free to be worked upon? if so can i take it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/commons-app/apps-android-commons/issues/3179?email_source=notifications&email_token=AI7ACH2SIIJZ5DQVUZWKGVDQ5MN6ZA5CNFSM4JBW444KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIW2OWQ#issuecomment-573417306, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7ACHYZUMAZFWYYVHDK77LQ5MN6ZANCNFSM4JBW444A .

kbhardwaj123 commented 4 years ago

Thank you:)

sivaraam commented 4 years ago

I'm re-opening this as I believe there's a problem with how this issue was fixed.

@kbhardwaj123 Can you clarify a doubt that I have regarding your PR #3326? In the description you say:

Tested from the MW api fuzzy search url that the category suggestions would deliver the desired results no matter what case you sent in the api call so to fix the issue api call has been converted to lower case.

Are you sure the API really doesn't care about the case of the category name given to it? I'm doubtful about that for several reasons. Here are a couple:

  1. The logical one: If the API really doesn't care about the case of the search text sent to it, this issue shouldn't exist to begin with. Right? IOW, if the API is returning us all categories that match a search text despite the case in which we send the query, then there's no point in just lower-casing the search text we send to the API. Got my point? But the mere existence of this issue indicates otherwise. Correct me if I'm missing something.
  2. The practical one: I just checked with a couple of API calls and I get different results based on the case of the search text I send to the query. Here are a couple of queries which return different results despite only the case of the search text differing:

In case you're wondering why the test case didn't fail. Here's the catch:

The page title is case-sensitive except the first character.

From Manual:Page title - MediaWiki

I think the quote speaks for itself. I'll share the actual problem w.r.t to the app in the next comment.

kbhardwaj123 commented 4 years ago

@sivaraam while I was working on this I went with what @nicolas-raoul suggested so I ensured that all the category strings being passed to the OkHttpClient are converted to lower case and I wrote new tests regarding that and they worked fine but I guess I must have missed something I will take a look at at again

sivaraam commented 4 years ago

Ok. Here's the issue with the respect to the app: category search doesn't return any categories with a prefix that has a upper case character in it (other than the first one, of course). See #3582 for proof.

In case the issue is not clear to you from #3582, here's another example.

Here's what I get when I search for categories with "COVID" (mind the case) in the app (version: 2.12.3.629~a63a358): Screenshot_2020-03-28-21-27-39

Now, consider the linked example query which returns 25 categories which have "COVID" as it's prefix. Here are the categories that the query returns:

Category:COVID-19 guidelines in Brazil
Category:COVID-19 guidelines in Argentina
Category:COVID-19 guidelines in Albania
Category:COVID-19
Category:COVID-19 guidelines by country
Category:COVID-19 guidelines in Czechia
Category:COVID-19 guidelines
Category:COVID-19 guidelines in Denmark
Category:COVID-19 guidelines in Esperanto
Category:COVID-19 Clinical Cohort Research Conference, March 18, 2019, National Medical Center, Republic of Korea
Category:COVID-19 coronavirus
Category:COVID-19 guidelines by language
Category:COVID-19 guidelines in Arabic
Category:COVID-19 guidelines in English
Category:COVID-19 guidelines in Basque
Category:COVID-19 guidelines in China
Category:COVID-19 guidelines in Estonian
Category:COVID-19 guidelines in Bengali
Category:COVID-19 guidelines in Bangladesh
Category:COVID-19 guideline cartoons by Anika Nawar Eeha and Abdullah Al Mamun in Bengali
Category:COVID-19 guidelines in Bengali by Anika Nawar Eeha and Abdullah Al Mamun
Category:COVID-19 guidelines in Catalan
Category:COVID-19 guidelines in East Timor
Category:COVID-19 DIY
Category:COVID-19 guidelines in Breton

As you can see, none of the above categories are shown in the category suggestions.

sivaraam commented 4 years ago

@misaochan Given that we've now accidentally reduced the category search space rather than increasing it, you might want to ensure we fix this before releasing the next version.

misaochan commented 4 years ago

Added to the release list, thanks for the heads up!

misaochan commented 4 years ago

Hi @kbhardwaj123 , are you currently still working on this? Please do keep us updated, thanks!

kbhardwaj123 commented 4 years ago

@misaochan sure I'm on it, will update ASAP

misaochan commented 4 years ago

Thanks @kbhardwaj123 ! As we are planning to include this in v2.13, when you submit your PR could you please rebase and submit it on the 2.13-release branch?

kbhardwaj123 commented 4 years ago

I investigated about the problem and here are my findings.

So Suppose I want to find the category Temple of Ishtar at Mari by entering temple of ishtar these are the results using

Now on reading the logs i realized that the method searchAll() in CategoriesModel was calling for prefix search and that right there is where the problem is, so i when i fix that by calling both prefix and search API and combining the result we finally get a case insensitive search.

But there's a catch We are using the beta flavor of the APIs which give the following results

Possible Solution AFAIK there are two ways

kbhardwaj123 commented 4 years ago

@misaochan @sivaraam @nicolas-raoul @maskaravivek I need your opinions on my investigation on this to fix it for v2.13 i mean are we going to use the production flavor of the APIs in v2.13

sivaraam commented 4 years ago

@kbhardwaj123 Thanks for the analysis. I'll look into it and share my comments soon. I have a quick doubt about one particular thing:

We are using the beta flavor of the APIs which give the following results

What do you mean by beta flavor of API? Do you mean the API hosted in the beta server (https://commons.wikimedia.beta.wmflabs.org/w/api.php) as opposed to the production server (https://commons.wikimedia.org/w/api.php)?

nicolas-raoul commented 4 years ago

For category search (and really any testing that does not involve actually uploading), please use the prodDebug flavor of the app. The beta server is unusable for most testing.

kbhardwaj123 commented 4 years ago

@sivaraam yes that's exactly what i meant the API hosted on beta server https://commons.wikimedia.beta.wmflabs.org/w/api.php has server API but it is unable to give the required result where as the production API https://commons.wikimedia.org/w/api.php gives the expected result as shown bu the links in by previous comment. @nicolas-raoul does prodDebug flavor use https://commons.wikimedia.org/w/api.php APIs ?

nicolas-raoul commented 4 years ago

@kbhardwaj123 Yes, prod* flavors use the production APIs, for instance https://commons.wikimedia.org/w/api.php . Sorry that our beta servers are not representative of production :'-(

kbhardwaj123 commented 4 years ago

@nicolas-raoul sure then the problem is solved already, I will create the pull request

sivaraam commented 4 years ago

@sivaraam yes that's exactly what i meant

Thanks for the clarification.

... the API hosted on beta server commons.wikimedia.beta.wmflabs.org/w/api.php has server API but it is unable to give the required result where as the production API commons.wikimedia.org/w/api.php gives the expected result as shown bu the links in by previous comment.

Let me now clarify something. The API hosted in the beta servers and the production servers would not differ in a functional manner. I'm glossing over a little but I believe it's fine for the case in question. You see different results when you use the beta server only because all the categories present that are present in the prod server are not present in the beta server. So, forget the fact that the APIs in the production servers are case-sensitive and the APIs in the beta servers are not, because both of them behave the same way. You can read more about the beta cluster in the following wiki page: Beta Cluster - MediaWiki

sure then the problem is solved already, I will create the pull request

Can you explain how you're going to fix this? I'm asking this to ensure everyone's on the same page. Also, I would suggest you to not rush this. I say this because making the category search case insensitive seems to be a lot complicated than it seems. It's better to know our options and choose the most appropriate one. If we have the release coming up soon soon we can always just revert the changes done in PR #3326 (which we would have to do anyway) and move with the release. We can then make the change after that in that case. @misaochan can comment better about the deadline.

sivaraam commented 4 years ago

You see different results when you use the beta server only because all the categories present that are present in the prod server are not present in the beta server. So, forget the fact that the APIs in the production servers are case-sensitive and the APIs in the beta servers are not, because both of them behave the same way.

Ok. Here's a proof for the fact that the Beta server behaves just the same way as the production server.

https://commons.wikimedia.beta.wmflabs.org/w/api.php?action=query&format=json&formatversion=2&generator=search&gsrnamespace=14&gsrsearch=testcat&gsrlimit=25&gsroffset=0

This returns the Category:TestCat despite the search term being testcat. So, the beta server's generator=search is case-insensitive too.

kbhardwaj123 commented 4 years ago

Let me now clarify something. The API hosted in the beta servers and the production servers would not differ in a functional manner. I'm glossing over a little but I believe it's fine for the case in question. You see different results when you use the beta server only because all the categories present that are present in the prod server are not present in the beta server. So, forget the fact that the APIs in the production servers are case-sensitive and the APIs in the beta servers are not, because both of them behave the same way. You can read more about the beta cluster in the following wiki page: Beta Cluster - MediaWiki

@sivaraam initially what i meant was that since beta servers don't have all the categories ( the working of both APIs is same that was clear from their documentation) this is what i wanted to show: Suppose i want category Temple of Ishtar at Mari by typing temple of ishtar only If the prodDebug APIs are used they give what one expects:

But the beta servers have a problem which is that they contain the category Temple of Ishtar at Mari using generator=allcategories see here but The generator=search is incapable of returning the category when provide it with temple of ishtar see here

Comprehensively the results displayed by the beta server's case sensitive API (generator=allcategories) is delivering a category which the case-insensitive API is not able to return and **no such problem is there in the prodDebug APIs

kbhardwaj123 commented 4 years ago

How i intend to solve this is that the searchAll method is the one at fault here, it only calls the prefixSearch API for searching categories so we we make a call using generator=search and combine both prefix and normal search results our problem would be solved. And yes we need to use the prodDebug APIs because of the point i just mentioned above.

kbhardwaj123 commented 4 years ago

@sivaraam yes i agree that the beta ones are case insensitive but they don't seem to return Category: Temple of Ishtar at Mari (using generator=search) while the case insensitive beta API (generator=allcategories) return the category see result

kbhardwaj123 commented 4 years ago

So i implemented the solution and with the beta server's APIs and this is how it looks with screenshots

using category suggested by @sivaraam Category:TestCat Screenshot_20200402_135943_fr free nrw commons beta

Now with Category:Temple of Ishtar at Mari (here i am showing that it exists on beta server): Screenshot_20200402_140016_fr free nrw commons beta

But from the following screenshot it is visible that generator=search doesn't return that which leaves this category as case sensitive Screenshot_20200402_140022_fr free nrw commons beta

And as soon as i change the flavor of the APIs to prodDebug all these problems dissappear

sivaraam commented 4 years ago

@kbhardwaj123 Thanks for your explanations. I see your problem now.

But from the following screenshot it is visible that generator=search doesn't return that which leaves this category as case sensitive

It's prudent to explore more before coming to conclusions. AFAIK, you can't just make some categories case sensitive and others case insensitive. It doesn't even make any sense, does it? Anyways, I'll try to clarify what's going on here. Here's the description of the search API from API:Search - MediaWiki [emphasis mine]:

GET request to search for a title or text in a wiki.

Just assume the search API does not search for the titles for now, I'll come back to the why such an assumption? later. Note that the search API looks for the text in the wiki pages. So, any query you send to generator=search looks for the search text in the contents of the wiki page (the category pages are the wiki pages, in our case). So, the results you get in the beta and production server depend not just on the presence of the categories it also are based on what content is present in the category pages. Let's take your case of the "Category:Temple of Ishtar at Mari".

I'm not very sure about how/why a page is included in the result as the algorithm seems to be more involved. Relevant quote from the "Additional notes" section in API:Search page page:

Depending on which search backend is in use, how srsearch is interpreted may vary. On Wikimedia wikis which use CirrusSearch, see Help:CirrusSearch for information about the search syntax.

Coming to the why assume search API doesn't search the title part. Try the following query:

https://commons.wikimedia.org/w/api.php?action=query&format=json&formatversion=2&generator=search&gsrnamespace=14&gsrsearch=temple%20of%20ishtar&gsrlimit=25&gsroffset=0&gsrwhat=title

I've just added gsrwhat=title to the query which tells it to search just the title. As you can see it would clearly say: "title" search is disabled.. Thus my assumption. See also: https://stackoverflow.com/q/14337219/5614968

I hope I've clarified your confusion about beta server not returning the results you expect, now. Let me know if I have not.

To conclude, the search API does more than what's needed (a category title search) and particularly doesn't seem to be searching the title at all. I don't think that would be a good choice. So, as I mentioned earlier we'll have to explore the proper way to achieve a case insensitive search. Here are a couple of related API pages:

Also, I believe we could ask the wikitech-l mailing list about this.

kbhardwaj123 commented 4 years ago

@sivaraam Thanks for such a comprehensive explanation :). I agree with you that search generator could be an overkill as you pointed out that searching temple of ishtar returns some completely unrelated categories as they contain that term in their wiki text body. So what i am thinking is that in the question on stackoverflow which you mentioned one person gave a workaround of using intitle as: srsearch=intitle:temple%20of%20ishtar could solve our issue and return only those categories with the required search term. Kindly give your opinions on this

kbhardwaj123 commented 4 years ago

@sivaraam i tried it and it returns exactly what we want, checkout the following link: https://commons.wikimedia.org/w/api.php?action=query&format=json&formatversion=2&generator=search&gsrnamespace=14&gsrsearch=intitle:temple%20of%20ishtar&gsrlimit=25&gsroffset=0

I feel that there's a tradeoff, i mean on one hand if we search the title of category then it gives quite relevant results but might eliminate some (though less relevant)possibly more suited categories but on the other hand it may also suggest some completely irrelevant results as you pointed out:

Also, the fact that the search API searches the content is very clear from the results of the above query which include categories such as Category:Astarte (goddess), Category:Passing lion Babylon (Louvre, AO21118)

I need opinions on: If the category search should be restricted to title (of it's wiki) only

nicolas-raoul commented 4 years ago

The search URL above looks better than what we currently have indeed, but still not perfect, I think Commons has a better one for us.

Users will type and should see results appear as they type. For instance, let's say I take a picture of a supermarket in Tokyo. I start typing "supermarkets in to"

How about using the API that sits behind that website search box? Is there any reason why it is not good enough?

sivaraam commented 4 years ago

I need opinions on: If the category search should be restricted to title (of it's wiki) only

In a word: yes. It's best to keep the search title only to ensure that the results are predictable and straightforward. Also, searching more than just the title is out of scope for this issue which is about making category search case insensitive. We can discuss enhancing the category search separately and focus on just making the category title search case insensitive for now.

sivaraam commented 4 years ago

How about using the API that sits behind that website search box?

Good idea. We would have to find how it works.

Is there any reason why it is not good enough?

I think we can answer this only after knowing how that works :)

kbhardwaj123 commented 4 years ago

@sivaraam @nicolas-raoul so i will make it title only search and create a separate issue for improving our category search functionality in favour of something similar to that of website search

sivaraam commented 4 years ago

so i will make it title only search ...

If you think of using intitle: part of search API then here's a problem I noticed with that. It only seems to be returning category pages for which a category page exists (just as expected). Examples:

This might not be a problem if we use the results of the generator=search API as a supplement to the results from the allcategories API (which does not have such a problem). But I just wonder if there is a better way to properly achieve this case insensitive category title search. That's why I was suggesting that we ask the wikitech-l mailing list. We could get a reliable answer of how to go about doing this.

kbhardwaj123 commented 4 years ago

@sivaraam sure in that case I agree that we should ask on the mail list, I will hold my PR on this issue

misaochan commented 4 years ago

Any luck with the mailing list? We are holding 2.13 for this at the moment. :)

sivaraam commented 4 years ago

Any luck with the mailing list?

Apologies. I never got around to sending the e-mail to the mailing list. Got hung up with other things. I'll try to send it by tomorrow if no one else beats me to it :)

We are holding 2.13 for this at the moment. :)

You don't have to hold it anymore ;). I've create #3636 that reverts the changes done in the PR #3326. You can merge that and move on with the release. We can handle the case insensitivity in the next release :)

kbhardwaj123 commented 4 years ago

@sivaraam i agree this would be our best option for now and i am really keen to see what mailing list would suggest regarding this issue :)

sivaraam commented 4 years ago

I did some searching and phabricator and came to know that case insensitive category title search is a long standing feature request that is yet to be addressed [ref 1] [ref 2]. The linked comment is a nice TL;DR of the status quo.

It seems we really can't use search API for the reason outlined in the comment I referred to previously and another comment in the same ticket which I'm quoting here:

Is it not possible to use the article search engine with (invisible) category: prefix instead ?

Wouldn't that search for pages in the category namespace, rather than actual categories? Some categories don't have associated pages, and you can create pages in the category namespace for non-existent categories.

That's right. To add to that here's another reason for why search API is not a proper fit. There's something called hidden categories [ref 1] [ref 2] in Mediawiki (the wiki engine behind Commons). My understanding of them is that these hidden categories aren't meant to be added by users directly. An example of such a hidden category in Commons is Category:Uses of Wikidata Infobox - Wikimedia Commons. There's a way to identify such hidden categories using the allcategories API while the search API doesn't have such an option. [side note: we should think about filtering away hidden categories before showing category suggestions. That's for another issue though :)]

Despite all this, I sent an e-mail to the mailing list just to confirm if my understanding is correct.

To conclude, it seems we really can't provide a case-insensitive category title search for now :(

What we could do is to mention about our category title search use case to the following phabricator ticket to clarify that category search is not as "niche" a feature as they think it is. https://phabricator.wikimedia.org/T187342 That might help us get an API that we can use soon.

kbhardwaj123 commented 4 years ago

@sivaraam thanks for such elaborate insights :). I have a doubt regarding the linked comment which you mentioned, it says:

Our regular search feature (aka "prefix index"), used for the main search field and used for the input field when creating an article link, is case-insensitive in most cases

The "prefix index" which is being reffered here is it the allcategory API's prefix search, if so then how is it case-insensitive in most cases, i am a little confused by this

misaochan commented 4 years ago

You don't have to hold it anymore ;). I've create #3636 that reverts the changes done in the PR #3326.

Awesome, thank you!

sivaraam commented 4 years ago

@sivaraam thanks for such elaborate insights :). I have a doubt regarding the linked comment which you mentioned, it says:

Our regular search feature (aka "prefix index"), used for the main search field and used for the input field when creating an article link, is case-insensitive in most cases

I would quote that fully to as it makes sense only when it is complete. Here it is for the sake of discussion:

Our regular search feature (aka "prefix index"), used for the main search field and used for the input field when creating an article link, is case-insensitive in most cases. On Wikimedia wikis this comes from CirrusSearch. On other wikis (and on WMF until recently) this was provided by the TitleKey extension. The search feature has a namespace filter as well. Which would allow us to do case-insensitive search of page titles in the Category namespace.

Read that fully before reading further.

The "prefix index" which is being reffered here is it the allcategory API's prefix search, if so then how is it case-insensitive in most cases, i am a little confused by this

I'm reasonably confident that the comment either refers to the API:Prefixsearch or someother API. It definitely is not referring to the allcategories API as it mentions a namespace filter which the allcategories doesn't have (and doesn't have any need for).

Hope that clarifies your doubt.

sivaraam commented 4 years ago

Despite all this, I sent an e-mail to the mailing list just to confirm if my understanding is correct.

And here's our confirmation of my observation:

https://lists.wikimedia.org/pipermail/wikitech-l/2020-April/093295.html

sivaraam commented 3 years ago

Re-opening as this issue seems to have been closed by mistake.

ashishkumar468 commented 3 years ago

@sivaraam This has been fixed via #3913. Does that not fix the issue for you?

sivaraam commented 3 years ago

3913 brings back the old case-sensitive behaviour. This issue is about making category search case-insensitive which is still an open question, to my understanding.

sivaraam commented 4 months ago

To continue the discussion from #5712, the only idea I could think of to improve our category search such that it behaves in a case-insensitive and fuzzy way is to possibly consider augmenting the results of the allcategories API with that of the API that the Special:UploadWizard uses (I suppose it is API:Opensearch as per @mnalis's finding).

This has the caveat that we would be starting to get hidden categories in our result again as API:opensearch does not know about hidden categories. Is that a fine trade off?

nicolas-raoul commented 4 months ago

Sounds like a negative tradeoff to me. 🤔