commons-app / apps-android-commons

The Wikimedia Commons Android app allows users to upload pictures from their Android phone/tablet to Wikimedia Commons
https://commons-app.github.io/
Apache License 2.0
1.03k stars 1.23k forks source link

Upload Wizard cannot find 和泉 (杉並区) #5794

Open nicolas-raoul opened 2 months ago

nicolas-raoul commented 2 months ago

I took a picture in the 和泉 neighborhood: https://ja.wikipedia.org/wiki/%E5%92%8C%E6%B3%89_%28%E6%9D%89%E4%B8%A6%E5%8C%BA%29

Because there are many more famous towns and people with the same name, it does not make it to the suggestions:

Screenshot_20240828-105202.png

No surprise so far.

But when I exactly type the full Wikipedia article name 和泉 (杉並区) , I get nothing:

Screenshot_20240828-110311.png

To select the correct depiction, the user has to navigate to the Wikidata item (https://m.wikidata.org/wiki/Q13495859), which ip a pain to do on mobile, and copy its QID then paste it into our app's depiction search textbox:

Screenshot_20240828-144639~2.png

It is not a Japanese-specific issue, it can happen for any language.

Maybe the Wikidata search API has an option to match also via article titles?

If not, it will be a difficult issue to implement, we might have to call an additional different API to get potential Wikidata items via the Wikipedia articles titles. Or we could batch-add article titles as aliasses if it is OK from an editorial point of view.

mnalis commented 2 months ago

But when I exactly type the full Wikipedia article name 和泉 (杉並区) , I get nothing:

As far as I understand, that is expected, as Commons app does not search Wikipedia for Depicts field, only Wikidata:

https://github.com/commons-app/apps-android-commons/blob/190135d36cd4d86906d38e1bf8c06a0613f81c04/app/src/main/java/fr/free/nrw/commons/upload/depicts/DepictsInterface.kt#L21-L28

and it only fetches 25 elements: https://github.com/commons-app/apps-android-commons/blob/190135d36cd4d86906d38e1bf8c06a0613f81c04/app/src/main/java/fr/free/nrw/commons/upload/structure/depictions/DepictModel.kt#L24

in your case, it searches for this which does not include Q13495859. Even if we increased the limit to (Server maximum) of 50, it still would not be found, because it is somewhere between 50th and 100th match, i.e. here

Or we could batch-add article titles as aliasses if it is OK from an editorial point of view.

Unfortunately I cannot read Japanese script so cannot tell if this specific case would be OK, but if it describes the city by its alternative names, it should be OK. More guidance may be found at: https://www.wikidata.org/wiki/Help:Aliases

However, doing mass-import of data not verified by human being is not likely to be OK (and should definitely first be discussed with wikidata admins even if it sounded like good idea). See https://www.wikidata.org/wiki/Wikidata:Data_Import_Guide for general considerations. And specifically, for importing from Wikipedia I'd foresee license issues (wikidata being CC0, and Wikipedia mostly CC-BY-SA 4.0, which cannot be imported to CC0)

It is not a Japanese-specific issue, it can happen for any language.

That is correct. For any popular name which has more than 25 matches; if the specific string you search for does not occur in TOP-25, you won't find a match :cry:

mnalis commented 2 months ago

However, as that API supports paginated search, it could be supported similar to idea proposed for categories search here https://github.com/commons-app/apps-android-commons/issues/3179#issuecomment-2145839062 in second bullet point, i.e. add Load more button at the bottom of results, so:

(etc. you get the idea, but your match from this specific issues would already be found)

That way, you'd be able to find your popular search term in all cases.


Alternatives to Load more...:

nicolas-raoul commented 2 months ago

@mnalis Thanks for the link https://www.wikidata.org/wiki/Help:Aliases ! This use case is not described, but not outright banned either... would you mind asking on the talk page? If implementing this via an additional API call, the number of results will be small (most likely 0 or 1) so just appending it to the existing results is fine even without paging.

An English equivalent could be Paris, Texas: https://www.wikidata.org/wiki/Q830149 Interestingly this one has "Paris, Texas" as an alias, presumably because people sometimes actually say "Paris, Texas" in normal conversations. That is not true for many other concepts, such as "Spring (hydrology)". Also, due to opposite grammatical order, nobody would say or write 和泉(杉並区) they would use 杉並区和泉.

mnalis commented 2 months ago

This use case is not described, but not outright banned either...

If you are talking about "we could batch-add article titles as aliasses" as the idea here, It looks like it is prohibited by step 1. of that import guidelines that I linked to.

If you are however talking about fixing this one specific example only, It would be best if you asked about it (I'm don't even read the script, much less can translate it or weigh its nuances)

If implementing this via an additional API call, the number of results will be small (most likely 0 or 1) so just appending it to the existing results is fine even without paging.

Perhaps, if we use some third API for searching wikipedia articles for exact title. But note that it would likely rarely help, as wikipedia titles are finicky, and IIRC user would somehow have to specify which wikipedia language to search in advance.

e.g. I don't think that searching for "Thành phố Hồ Chí Minh" in titles of English Wikipedia would work, and searching for "Ho Chi Minh" on English wikipedia won't work either if you're only matching on exact title -- as article is named "Ho Chi Minh City"; and if you go after partial title results, then there will be much more than 0 or 1 results (there are likely many articles starting with "Ho"), and you'd still need to do paging (more complex when you need to page two different APIs at the same time!)

Given that just paging on your original query would've solved the issue issue, I think that should be first step anyway (as you'd likely have to implement it anyway for more complex solutions too)

whym commented 2 months ago

In the Wikidata website's search results, the town is at the 3rd place for 杉並 和泉 (or Suginami[space]Izumi). So just adding some right terms for disambiguation seems to help, and that's what I would do manually, if I don't find it with just 和泉 (Izumi).

More broadly, perhaps we could filter and rerank the raw search results to prominently show items that are more likely to be the target of depict. In this case, the same term (in written form) 和泉 can refer to family names, but names are not much ikely to be depicted in a photo: https://www.wikidata.org/wiki/Q26216237 https://www.wikidata.org/wiki/Q26216228

Izumi (Q13495859) has location, so we could theoretically check how close it is to the user's location and use that to boost it in reranking.

nicolas-raoul commented 2 months ago

Great finding!

We should use the same search API URL as the desktop website, it gets results where we get nothing.

This sounds much easier to implement than the solutions we had considered above.

The proximity idea is great for a subsequent phase.

Search results from the desktop website:

Screenshot_20240829-221016.png

Screenshot_20240829-220701.png

Screenshot_20240829-220622.png

Surprisingly the mobile website's search is not good:

Screenshot_20240829-220409~2.png

Screenshot_20240829-220430~2.png