Open mnalis opened 4 months ago
Nice observation. Thanks for opening this issue, @mnalis !
As you note, the warning about gacoffset is not a problem. The problem is this code-snippet in our codebase where we intentionally filter out categories that have a year in them. I don't think we aimed to filter out valid categories like this one. The intention was only to filter out the invalid ones. Looks like this one got filtered out as a side effect.
We should explore tweaking the logic to avoid this.
That API call seems to return boolean field named hidden
, so I first wondered wouldn't just skipping categories marked as hidden instead be enough? Quick check seems to indicate that Media needing categories mentioned in code snippet has hidden: true
. Although Photographs taken on does not (for whatever reason, it definitely looks like it should be hidden category)... So that is dead end.
However https://github.com/commons-app/apps-android-commons/issues/1029 is another problem; those categories should not be hidden, but they might not be useful for upload of new pictures. However, I find such overreaching regex as .*(19|20)\\d{2}.*
(which filters out anything containing 4digit year in this and previous century) is waaay too aggressive.
I do not find in that ticket the reasoning to exclude everything resembling a year (closest is to filter out things "in the 1960s", but that has separate regex .*0s.*
, and always has s
after 0
at the end to indicate decade).
I would suggest just removing regex .*(19|20)\\d{2}.*
unless someone (@nicolas-raoul, you opened that issue initially, do you remember any such cases?) can provide examples when category search is producing irrelevant categories containing years for the users, in which case I'd write separate (much more specific / with less false positives) regex.
While there I'd also update .*(200|201)0s.*
regex to cover 2020s too (i.e. .*20[0-2]0s.*
), as that has become a thing lately.
(unless someone else would like to do their first code contribution, I could also make a PR for that -- if people agree with the reasoning above)
Categories with years were really polluting category suggestions before that regexp. Whatever you typed, you would get 20 times the same result with different year, burying all but one useful result.
The ship being unduly filtered is a sad side-effect, but a quite rare occurrence.
App-side filtering could be a solution.
@mnalis could you please assign me for this task
@HrithikPadavala We have not yet reached a conclusion about what strategy to use. Would you mind trying another bug? Thanks! :-)
@nicolas-raoul sure, thanks
@HrithikPadavala There are some issues which we explicitly need for the upcoming 5.0.2 release. You could check them here: https://github.com/commons-app/apps-android-commons/milestone/11
Feel free to work on something that you may find interesting there 🙂
That API call seems to return boolean field named
hidden
, so I first wondered wouldn't just skipping categories marked as hidden instead be enough? Quick check seems to indicate that Media needing categories mentioned in code snippet hashidden: true
. Although Photographs taken on does not (for whatever reason, it definitely looks like it should be hidden category)... So that is dead end.
Yeah. That is unfortunate. Could we possible discuss with the Commons community to explore the feasibility of turning the categories into hidden ones? That would help us weed out the filtering we do at our end?
However #1029 is another problem; those categories should not be hidden, but they might not be useful for upload of new pictures. However, I find such overreaching regex as
.*(19|20)\\d{2}.*
(which filters out anything containing 4digit year in this and previous century) is waaay too aggressive.
I could understand this.
I do not find in that ticket the reasoning to exclude everything resembling a year
I think that's because the year filtering has been in place even before that ticket. The relevant discussion could be seen in #47.
(closest is to filter out things "in the 1960s", but that has separate regex .0s., and always has s after 0 at the end to indicate decade).
It seems to me like #47 only focused on categories other than "in the 1990s" etc. So, its tough to conclude if removing the year filter altogether would be an ideal solution. Let's discuss to see if we could come up with a better solution to this problem.
While there I'd also update
.*(200|201)0s.*
regex to cover 2020s too (i.e..*20[0-2]0s.*
), as that has become a thing lately.
That's a good suggestion.
(unless someone else would like to do their first code contribution, I could also make a PR for that -- if people agree with the reasoning above)
I think for now you could feel free to raise a separate PR for updating the .*(200|201)0s.*
regex. And possibly also include the comment reference to #47 for the regex that filters out all years 🙂
Yeah. That is unfortunate. Could we possible discuss with the Commons community to explore the feasibility of turning the categories into hidden ones? That would help us weed out the filtering we do at our end?
Perhaps. I'm not sure where to request that though, and it would not really remove the need to filter on our side (as I noted above not all problematic categories should be hidden) or simplify code - it would just reduce a number of regexes slightly. Given that regexes are something I can write well at least, it's probably not a priority (unless I'm missing something).
The ship being unduly filtered is a sad side-effect, but a quite rare occurrence.
I would disagree that is rare. For example it includes at least following categories:
Olea (ship, 1981)
above)1954 Maserati
* seriesOkehampton railway station (1997 to 2019)
North Avenue Bridge (Chicago, 1907 bascule)
Deutsche Bibliothek (1959 Frankfurt building)
102 35 gun 1914 model
1980 Chicago White Sox season
1981 European Council in Maastricht
101 Dalmatians (1996 film)
"We" (1927 book)
Sacred Heart of Jesus (1960 sculpture)
1981 stock certificates from Germany
1981 tram track reconstructions in Prague
Reichsmark banknotes in 1948 of the Soviet occupation zone
Russian Revolution of 1905 on stamps
and probably many more (i.e. any human-built thing that was built to survive more than few years).
It seems to me like https://github.com/commons-app/apps-android-commons/issues/47 only focused on categories other than "in the 1990s" etc. So, its tough to conclude if removing the year filter altogether would be an ideal solution. Let's discuss to see if we could come up with a better solution to this problem.
Ah, thanks for the pointer, that reference helps a lot! While I can see the reasoning, the issue is that such filter then leaves out way too much.
Perhaps the solution might be not to filter out everything that contain a year in a category name, but to instead sort it down to the end of the list?
That way, people who are finding their result earlier in the list won't be bothered by them, and those who are desperately trying to find a category they are looking for will have to scroll to the bottom of the list, but would at least find their match there.
Alternative idea (building on idea in https://github.com/commons-app/apps-android-commons/issues/3179#issuecomment-2145839062) is to filter out categories with year initially, but it user does not find results and presses manual Load more...
button, than we not only do extra API call, but also un-filter results we initially filtered (as it is obvious user is looking for any match to their search at that point).
I think for now you could feel free to raise a separate PR for updating the .(200|201)0s. regex. And possibly also include the comment reference to https://github.com/commons-app/apps-android-commons/issues/47 for the regex that filters out all years 🙂
Done that part in https://github.com/commons-app/apps-android-commons/pull/5761
Could we possible discuss with the Commons community to explore the feasibility of turning the categories into hidden ones? That would help us weed out the filtering we do at our end?
Perhaps. I'm not sure where to request that though, and it would not really remove the need to filter on our side (as I noted above not all problematic categories should be hidden) or simplify code - it would just reduce a number of regexes slightly. Given that regexes are something I can write well at least, it's probably not a priority (unless I'm missing something).
It's not a blocker certainly. I was mostly suggesting that as a way to start discussions on this so that we could at some future point of time reduce the amount of regex we need to filter out categories. It does take time to arrive at a consensus with the community 🙂
I would disagree that is rare. For example it includes at least following categories:
[ ... snip ... ]
and probably many more (i.e. any human-built thing that was built to survive more than few years).
Ah. That's a big list. Thanks for taking the time to mention the potential categories that we're filtering out accidentally as a consequence of that regex. It does seem like we should do something to improve the situation.
- Perhaps the solution might be not to filter out everything that contain a year in a category name, but to instead sort it down to the end of the list? That way, people who are finding their result earlier in the list won't be bothered by them, and those who are desperately trying to find a category they are looking for will have to scroll to the bottom of the list, but would at least find their match there.
That's a good idea. We could certainly try this!
- Alternative idea (building on idea in Make category search non case-sensitive and more user friendly #3179 (comment)) is to filter out categories with year initially, but it user does not find results and presses manual
Load more...
button, than we not only do extra API call, but also un-filter results we initially filtered (as it is obvious user is looking for any match to their search at that point).
This is also a fine idea but I believe the sorting down approach would be a bit more tangible as a quick remedy to the situation. If anyone is willing to achieve it this way, feel free to chime in.
I think for now you could feel free to raise a separate PR for updating the .(200|201)0s. regex. And possibly also include the comment reference to #47 for the regex that filters out all years 🙂
Done that part in #5761
Great. Thanks for that! I've left a comment there. Kindly check the same. Once that's resolved, it should be good to go.
@mnalis I just discovered that there's this long standing PR of mine #4902 which actually proposes to stop filtering out all categories that have an year in it. There's already some related discussion in issue #4901 too. Specifically, there's some consensus on what we should and should not filter out [ref].
So, I'm starting to do you think if we could just take forward a combination of your changes (#5761) and the one proposed in #4902 in the next release and see how it goes 🤔
@sivaraam sounds great to me; I'm all for it. :rocket:
If we find out that few more unneeded categories start to pop out, we can tune the regexes to filter out only those problematic/spammy categories, instead of everything containing a year.
Summary
While doing the research for https://github.com/commons-app/apps-android-commons/issues/3179#issuecomment-2144974646 I've noticed that manually run API calls do not always return same as I see in the commons app.
According to @sivaraam for searching this API should be used in such case:
https://github.com/commons-app/apps-android-commons/blob/06e25a074c5d0a3c23565edc4dc5781977bd97d8/app/src/main/java/fr/free/nrw/commons/category/CategoryInterface.kt#L34-L39
but when I call that API manually (see "Expected behaviour" section), I get some elements missing (i.e. 2 results instead of 3). Could someone help debug this?
Probably unrelated, but API also returns warning "Unrecognized parameter: gacoffset." so it seems like we shouldn't be using that anymore; but does not seem like it should be the cause of the issue.
Steps to reproduce
Olea (s
Expected behaviour
3 categories displayed:
as that is what API call seems to return when run manually: https://commons.wikimedia.org/w/api.php?format=json&action=query&formatversion=2&generator=allcategories&prop=categoryinfo|description|pageimages&piprop=thumbnail&pithumbsize=70&gaclimit=25&gacoffset=0&gacprefix=Olea+(s
Actual behaviour
Only 2 categories displayed (See screenshot):
Device name
Samsung Galaxy S23+
Android version
Android 14 (OneUI 6.1)
Commons app version
5.0.1~af028cbdd (latest F-droid)
Device logs
nothing worthwhile (i.e. nothing mentioning the search or API calls or "Olea"...)
Screen-shots
Would you like to work on the issue?
None