dalejn / cleanBib

Probabilistically assign gender and race proportions of first/last authors pairs in bibliography entries
MIT License
149 stars 31 forks source link

Manual correction of gender predictions #54

Closed hpay closed 1 month ago

hpay commented 8 months ago

Is there a way to manually correct predictions? For example, according to the predictions.csv file, it thinks a particular name is male, whereas I happen to know that this author is female.

dalejn commented 8 months ago

If you saved predictions.csv, you could change the cell manually and re-run only the plot_histgrams function. Linking it here https://github.com/dalejn/cleanBib/blob/ce9e01812bfff56b1f12fa8decd8cb0f0236d423/utils/queries.py#L341

The recalculated proportions you want are in the dat_for_baserate_plot variable

hpay commented 8 months ago

Got it, thanks!

hpay commented 7 months ago

Sorry to bother you again, but how do I actually get the values out of dat_for_baserate_plot? I edited plot_historgram to return dat_for_baserate_plot but can't print out the result. Python beginner here. I tried:

dat_for_baserate_plot = plot_histograms()
print(dat_for_baserate_plot)

"None" is displayed

dalejn commented 7 months ago

Happy to help! Sorry, for the late reply and for being unclear--the code within the plot_histograms() function is what you need.

  1. Upload your modified predictions.csv file to the Binder environment (like you did for the .bib file)
  2. Run just the 1st block of code to import the packages needed to run the relevant piece of code
  3. Start a new code block by clicking the + on the top-left and paste the code below into that empty code block, then run it:
    
    names = pd.read_csv('/home/jovyan/predictions.csv')
    total_citations = names.CitationKey.nunique()
    names.GendCat = names.GendCat.str.replace('female', 'W', regex=False)
    names.GendCat = names.GendCat.str.replace('male', 'M', regex=False)
    names.GendCat = names.GendCat.str.replace('unknown', 'U', regex=False)
    gend_cats = names['GendCat'].dropna().unique()  # get a vector of all the gender categories in your paper

dat_for_plot = names.groupby('GendCat').size().reset_index() all_cats = ['MU', 'WW', 'UM', 'MW', 'WM', 'UW', 'MM'] empty_dat_for_plot = pd.DataFrame(0, index=np.arange(7), columns=['GendCat', 0]) empty_dat_for_plot['GendCat'] = all_cats set(dat_for_plot['GendCat']).intersection(empty_dat_for_plot['GendCat']) for i in set(dat_for_plot['GendCat']).intersection(empty_dat_for_plot['GendCat']): empty_dat_for_plot.loc[empty_dat_for_plot['GendCat'] == i, 0] = dat_for_plot.loc[dat_for_plot['GendCat']== i, 0].values dat_for_plot = empty_dat_for_plot dat_for_plot.rename(columns={0:'count'}, inplace=True) dat_for_plot = dat_for_plot.assign(percentage=dat_for_plot['count']/total_citations*100)

dat_for_baserate_plot = dat_for_plot.loc[(dat_for_plot.GendCat == 'WW') | (dat_for_plot.GendCat == 'MW') | (dat_for_plot.GendCat == 'WM') | (dat_for_plot.GendCat == 'MM'),:] baserate = [6.7, 9.4, 25.5, 58.4] dat_for_baserate_plot['baserate'] = baserate dat_for_baserate_plot = dat_for_baserate_plot.assign(citation_rel_to_baserate= dat_for_baserate_plot.percentage - dat_for_baserate_plot.baserate )

dat_for_baserate_plot


Feel free to let me know if there's any issues!