MER-C / wiki-java

A MediaWiki bot framework in Java
GNU Affero General Public License v3.0
66 stars 58 forks source link

Unaccessible gender-aware namespace aliases #181

Closed PeterBowman closed 3 years ago

PeterBowman commented 3 years ago

On some non-English language projects, a dedicate user namespace prefix alias is assigned to users that choose to pick female gender in their preferences. For instance, on plwiki male/unspecified gender users get the default Wikipedysta prefix, whereas female ones are identified with Wikipedystka (cf. Benutzer/Benutzerin on German projects, Usuario/Usuaria on Spanish wikis and so on).

Wiki.java automatically falls back to the default/male language-specific prefix upon normalization. It is not different from other normalization use cases, i.e. (for plwiki) User->Wikipedysta, wikipedysta->Wikipedysta, Wikipedystka->Wikipedysta. However, MediaWiki honors the gender setting when a user page is queried.

Let's query w:pl:User:Cancre (on-wiki displayed as Wikipedystka:Cancre, female prefix alias) and also w:pl:User:Przykuta (Wikipedysta:Przykuta, male/default prefix) just for comparison (api.php):

<?xml version="1.0"?>
<api batchcomplete="">
  <query>
    <normalized>
      <n from="User:Cancre" to="Wikipedystka:Cancre" />
      <n from="User:Przykuta" to="Wikipedysta:Przykuta" />
    </normalized>
    <pages>
      <page _idx="320152" pageid="320152" ns="2" title="Wikipedystka:Cancre" />
      <page _idx="93794" pageid="93794" ns="2" title="Wikipedysta:Przykuta" />
    </pages>
  </query>
</api>

Wiki.java expects the normalized page name to also fall back to the male/default prefix (Wikipedysta:Cancre). It can't find it in the pages array, though, because of the special treatment of gender aliases in this specific namespace. Example:

var wiki = Wiki.newSession("pl.wikipedia.org");
wiki.getPageInfo(List.of("User:Cancre", "User:Przykuta")).forEach(System.out::println);

Result (first line refers to User:Cancre):

null
{redirect=false, size=550, lastpurged=2018-09-06T04:27:13Z, exists=true, watchers=159, protection={editexpiry=null, move=autoconfirmed, edit=autoconfirmed, cascade=false, moveexpiry=null}, pageid=93794, displaytitle=Wikipedysta:Przykuta, lastrevid=44294744, inputpagename=User:Przykuta, pagename=Wikipedysta:Przykuta, timestamp=2021-04-02T19:03:26.546280+02:00}

Reason: Wiki.java calls normalize() internally and reorders the query results according to the input titles. This normalize() method does not take into account the gender of the underlying user a user page refers to. The following scheme can be found in several places, e.g. getPageInfo():

https://github.com/MER-C/wiki-java/blob/c8cc5a1d24911a189773aadcfb14b4a58edb4e23/src/org/wikipedia/Wiki.java#L1754-L1763

Since getPageInfo() is always called internally by edit(), this bug makes it impossible to edit user pages prefixed with female aliases on gender-aware language wikis.

PeterBowman commented 3 years ago

Possible solution: parse the <normalized> element if present and use that information instead of normalize() to link query results with input titles. I'd implement some sort of resolveNormalizedParser() helper method (analogous to resolveRedirectParser()) for that matter. The existing normalize() method would be explicitly documented to serve limited offline-based title normalization purposes, remarking that it's not fully aware of certain quirks (such as gender aliasing) for obvious reasons.

Bonus: solving this would also solve https://github.com/MER-C/wiki-java/issues/162.

@MER-C are you OK with this proposal? I'd be happy to work on a patch if so.

MER-C commented 3 years ago

Sounds good.