commons-app / apps-android-commons

The Wikimedia Commons Android app allows users to upload pictures from their Android phone/tablet to Wikimedia Commons
https://commons-app.github.io/
Apache License 2.0
1.01k stars 1.2k forks source link

[Bug]: categories sometimes missing in search (while visible in manual API calls!) #5749

Open mnalis opened 4 months ago

mnalis commented 4 months ago

Summary

While doing the research for https://github.com/commons-app/apps-android-commons/issues/3179#issuecomment-2144974646 I've noticed that manually run API calls do not always return same as I see in the commons app.

According to @sivaraam for searching this API should be used in such case:

https://github.com/commons-app/apps-android-commons/blob/06e25a074c5d0a3c23565edc4dc5781977bd97d8/app/src/main/java/fr/free/nrw/commons/category/CategoryInterface.kt#L34-L39

but when I call that API manually (see "Expected behaviour" section), I get some elements missing (i.e. 2 results instead of 3). Could someone help debug this?


Probably unrelated, but API also returns warning "Unrecognized parameter: gacoffset." so it seems like we shouldn't be using that anymore; but does not seem like it should be the cause of the issue.

Steps to reproduce

  1. start uploading image and proceed to "Step 3" (category selection)
  2. type Olea (s
  3. wait for category results to display

Expected behaviour

3 categories displayed:

as that is what API call seems to return when run manually: https://commons.wikimedia.org/w/api.php?format=json&action=query&formatversion=2&generator=allcategories&prop=categoryinfo|description|pageimages&piprop=thumbnail&pithumbsize=70&gaclimit=25&gacoffset=0&gacprefix=Olea+(s

Actual behaviour

Only 2 categories displayed (See screenshot):

Device name

Samsung Galaxy S23+

Android version

Android 14 (OneUI 6.1)

Commons app version

5.0.1~af028cbdd (latest F-droid)

Device logs

nothing worthwhile (i.e. nothing mentioning the search or API calls or "Olea"...)

Screen-shots

small_Screenshot_20240603_205048_Commons

Would you like to work on the issue?

None

sivaraam commented 4 months ago

Nice observation. Thanks for opening this issue, @mnalis !

As you note, the warning about gacoffset is not a problem. The problem is this code-snippet in our codebase where we intentionally filter out categories that have a year in them. I don't think we aimed to filter out valid categories like this one. The intention was only to filter out the invalid ones. Looks like this one got filtered out as a side effect.

We should explore tweaking the logic to avoid this.

mnalis commented 4 months ago

That API call seems to return boolean field named hidden, so I first wondered wouldn't just skipping categories marked as hidden instead be enough? Quick check seems to indicate that Media needing categories mentioned in code snippet has hidden: true. Although Photographs taken on does not (for whatever reason, it definitely looks like it should be hidden category)... So that is dead end.


However https://github.com/commons-app/apps-android-commons/issues/1029 is another problem; those categories should not be hidden, but they might not be useful for upload of new pictures. However, I find such overreaching regex as .*(19|20)\\d{2}.* (which filters out anything containing 4digit year in this and previous century) is waaay too aggressive.

I do not find in that ticket the reasoning to exclude everything resembling a year (closest is to filter out things "in the 1960s", but that has separate regex .*0s.*, and always has s after 0 at the end to indicate decade).

I would suggest just removing regex .*(19|20)\\d{2}.* unless someone (@nicolas-raoul, you opened that issue initially, do you remember any such cases?) can provide examples when category search is producing irrelevant categories containing years for the users, in which case I'd write separate (much more specific / with less false positives) regex.

While there I'd also update .*(200|201)0s.* regex to cover 2020s too (i.e. .*20[0-2]0s.*), as that has become a thing lately.

(unless someone else would like to do their first code contribution, I could also make a PR for that -- if people agree with the reasoning above)

nicolas-raoul commented 3 months ago

Categories with years were really polluting category suggestions before that regexp. Whatever you typed, you would get 20 times the same result with different year, burying all but one useful result.

The ship being unduly filtered is a sad side-effect, but a quite rare occurrence.

App-side filtering could be a solution.

HrithikPadavala commented 3 months ago

@mnalis could you please assign me for this task

nicolas-raoul commented 3 months ago

@HrithikPadavala We have not yet reached a conclusion about what strategy to use. Would you mind trying another bug? Thanks! :-)

HrithikPadavala commented 3 months ago

@nicolas-raoul sure, thanks

sivaraam commented 3 months ago

@HrithikPadavala There are some issues which we explicitly need for the upcoming 5.0.2 release. You could check them here: https://github.com/commons-app/apps-android-commons/milestone/11

Feel free to work on something that you may find interesting there 🙂

sivaraam commented 3 months ago

That API call seems to return boolean field named hidden, so I first wondered wouldn't just skipping categories marked as hidden instead be enough? Quick check seems to indicate that Media needing categories mentioned in code snippet has hidden: true. Although Photographs taken on does not (for whatever reason, it definitely looks like it should be hidden category)... So that is dead end.

Yeah. That is unfortunate. Could we possible discuss with the Commons community to explore the feasibility of turning the categories into hidden ones? That would help us weed out the filtering we do at our end?

However #1029 is another problem; those categories should not be hidden, but they might not be useful for upload of new pictures. However, I find such overreaching regex as .*(19|20)\\d{2}.* (which filters out anything containing 4digit year in this and previous century) is waaay too aggressive.

I could understand this.

I do not find in that ticket the reasoning to exclude everything resembling a year

I think that's because the year filtering has been in place even before that ticket. The relevant discussion could be seen in #47.

(closest is to filter out things "in the 1960s", but that has separate regex .0s., and always has s after 0 at the end to indicate decade).

It seems to me like #47 only focused on categories other than "in the 1990s" etc. So, its tough to conclude if removing the year filter altogether would be an ideal solution. Let's discuss to see if we could come up with a better solution to this problem.

While there I'd also update .*(200|201)0s.* regex to cover 2020s too (i.e. .*20[0-2]0s.*), as that has become a thing lately.

That's a good suggestion.

(unless someone else would like to do their first code contribution, I could also make a PR for that -- if people agree with the reasoning above)

I think for now you could feel free to raise a separate PR for updating the .*(200|201)0s.* regex. And possibly also include the comment reference to #47 for the regex that filters out all years 🙂

mnalis commented 3 months ago

Yeah. That is unfortunate. Could we possible discuss with the Commons community to explore the feasibility of turning the categories into hidden ones? That would help us weed out the filtering we do at our end?

Perhaps. I'm not sure where to request that though, and it would not really remove the need to filter on our side (as I noted above not all problematic categories should be hidden) or simplify code - it would just reduce a number of regexes slightly. Given that regexes are something I can write well at least, it's probably not a priority (unless I'm missing something).

The ship being unduly filtered is a sad side-effect, but a quite rare occurrence.

I would disagree that is rare. For example it includes at least following categories:

and probably many more (i.e. any human-built thing that was built to survive more than few years).

It seems to me like https://github.com/commons-app/apps-android-commons/issues/47 only focused on categories other than "in the 1990s" etc. So, its tough to conclude if removing the year filter altogether would be an ideal solution. Let's discuss to see if we could come up with a better solution to this problem.

Ah, thanks for the pointer, that reference helps a lot! While I can see the reasoning, the issue is that such filter then leaves out way too much.

I think for now you could feel free to raise a separate PR for updating the .(200|201)0s. regex. And possibly also include the comment reference to https://github.com/commons-app/apps-android-commons/issues/47 for the regex that filters out all years 🙂

Done that part in https://github.com/commons-app/apps-android-commons/pull/5761

sivaraam commented 3 months ago

Could we possible discuss with the Commons community to explore the feasibility of turning the categories into hidden ones? That would help us weed out the filtering we do at our end?

Perhaps. I'm not sure where to request that though, and it would not really remove the need to filter on our side (as I noted above not all problematic categories should be hidden) or simplify code - it would just reduce a number of regexes slightly. Given that regexes are something I can write well at least, it's probably not a priority (unless I'm missing something).

It's not a blocker certainly. I was mostly suggesting that as a way to start discussions on this so that we could at some future point of time reduce the amount of regex we need to filter out categories. It does take time to arrive at a consensus with the community 🙂

I would disagree that is rare. For example it includes at least following categories:

[ ... snip ... ]

and probably many more (i.e. any human-built thing that was built to survive more than few years).

Ah. That's a big list. Thanks for taking the time to mention the potential categories that we're filtering out accidentally as a consequence of that regex. It does seem like we should do something to improve the situation.

  • Perhaps the solution might be not to filter out everything that contain a year in a category name, but to instead sort it down to the end of the list? That way, people who are finding their result earlier in the list won't be bothered by them, and those who are desperately trying to find a category they are looking for will have to scroll to the bottom of the list, but would at least find their match there.

That's a good idea. We could certainly try this!

  • Alternative idea (building on idea in Make category search non case-sensitive and more user friendly #3179 (comment)) is to filter out categories with year initially, but it user does not find results and presses manual Load more... button, than we not only do extra API call, but also un-filter results we initially filtered (as it is obvious user is looking for any match to their search at that point).

This is also a fine idea but I believe the sorting down approach would be a bit more tangible as a quick remedy to the situation. If anyone is willing to achieve it this way, feel free to chime in.

I think for now you could feel free to raise a separate PR for updating the .(200|201)0s. regex. And possibly also include the comment reference to #47 for the regex that filters out all years 🙂

Done that part in #5761

Great. Thanks for that! I've left a comment there. Kindly check the same. Once that's resolved, it should be good to go.

sivaraam commented 2 months ago

@mnalis I just discovered that there's this long standing PR of mine #4902 which actually proposes to stop filtering out all categories that have an year in it. There's already some related discussion in issue #4901 too. Specifically, there's some consensus on what we should and should not filter out [ref].

So, I'm starting to do you think if we could just take forward a combination of your changes (#5761) and the one proposed in #4902 in the next release and see how it goes 🤔

mnalis commented 2 months ago

@sivaraam sounds great to me; I'm all for it. :rocket:

If we find out that few more unneeded categories start to pop out, we can tune the regexes to filter out only those problematic/spammy categories, instead of everything containing a year.