JohnSmithDev / ISFDB-Tools

Tools to query a local copy of the ISFDB database
5 stars 1 forks source link

Apparent regression: Paul Witcover being gender categorized using "Judith Lessing" alias #15

Closed JohnSmithDev closed 5 years ago

JohnSmithDev commented 5 years ago

Noticed by chance when comparing an older chart for Tiptree award with my new code.

In 1997 Paul Witcover was listed as a Tiptree finalist/nomination:

http://www.isfdb.org/cgi-bin/ay.cgi?43+1997

His personal page implies that Paul Witcover is his real name, but he used Judith Lessing as a Pseudonym:

However, it seems the latest code is picking up the pseudonym:

(book_scraping) isfdb_tools $ ./award_gender_report.py -W "James Tiptree, Jr. Award" -C "Gender-bending SF" -y 1997 ... WARNING:root:No Twitter link(s) for Paul Witcover 1997 : F : Paul Witcover : human-names:Judith Lessing

Curiously the author_gender.py script gets the right answer:

(book_scraping) isfdb_tools $ ./author_gender.py -A "Paul Witcover" WARNING:root:No Twitter link(s) for ['Paul Witcover'] WARNING:root:Not able to get gender using author_ids [3161, 108589] (ref=['Paul Witcover']) - will try to get gender from name instead M (source: human-names)

The awarded work seems to have only ever been credited to Paul Witcover:

http://www.isfdb.org/cgi-bin/title.cgi?8616

JohnSmithDev commented 5 years ago

The right author id and name comes back from get_definitive_authors(), so it's something in analyse_authors_by_gender()

JohnSmithDev commented 5 years ago

Getting closer: author_gender.get_author_gender_from_ids_and_then_name_cached(). Have now created a test case that fails.

This calls get_author_aliases() which returns the 2 names, but Judith Lessing is first. get_author_aliases() does have logic to return the names in resemblance/relevance order first, but that only applies if a textual name is provided, and we are passing in a numeric author_id.

Perhaps get_author_gender_from_id_and_then_name should use the passed name first, and only if that fails, use the aliases? That function used to support multiple IDs being passed (which was something I probably never did in practice), but now it only accepts a single ID, and we can assume that the ID being passed in is probably the best one?

JohnSmithDev commented 5 years ago

Amazing - now that I provide the correct issue number in the commit message, I seem to have used a format that hasn't been picked up.

One minor niggle - I'm curious why this was doing the right thing a week ago for the initial launch of the gender project, but what exactly got changed in the meantime to cause this regression.