Closed hpay closed 1 month ago
If you saved predictions.csv, you could change the cell manually and re-run only the plot_histgrams function. Linking it here https://github.com/dalejn/cleanBib/blob/ce9e01812bfff56b1f12fa8decd8cb0f0236d423/utils/queries.py#L341
The recalculated proportions you want are in the dat_for_baserate_plot variable
Got it, thanks!
Sorry to bother you again, but how do I actually get the values out of dat_for_baserate_plot? I edited plot_historgram to return dat_for_baserate_plot but can't print out the result. Python beginner here. I tried:
dat_for_baserate_plot = plot_histograms()
print(dat_for_baserate_plot)
"None" is displayed
Happy to help! Sorry, for the late reply and for being unclear--the code within the plot_histograms() function is what you need.
names = pd.read_csv('/home/jovyan/predictions.csv')
total_citations = names.CitationKey.nunique()
names.GendCat = names.GendCat.str.replace('female', 'W', regex=False)
names.GendCat = names.GendCat.str.replace('male', 'M', regex=False)
names.GendCat = names.GendCat.str.replace('unknown', 'U', regex=False)
gend_cats = names['GendCat'].dropna().unique() # get a vector of all the gender categories in your paper
dat_for_plot = names.groupby('GendCat').size().reset_index() all_cats = ['MU', 'WW', 'UM', 'MW', 'WM', 'UW', 'MM'] empty_dat_for_plot = pd.DataFrame(0, index=np.arange(7), columns=['GendCat', 0]) empty_dat_for_plot['GendCat'] = all_cats set(dat_for_plot['GendCat']).intersection(empty_dat_for_plot['GendCat']) for i in set(dat_for_plot['GendCat']).intersection(empty_dat_for_plot['GendCat']): empty_dat_for_plot.loc[empty_dat_for_plot['GendCat'] == i, 0] = dat_for_plot.loc[dat_for_plot['GendCat']== i, 0].values dat_for_plot = empty_dat_for_plot dat_for_plot.rename(columns={0:'count'}, inplace=True) dat_for_plot = dat_for_plot.assign(percentage=dat_for_plot['count']/total_citations*100)
dat_for_baserate_plot = dat_for_plot.loc[(dat_for_plot.GendCat == 'WW') | (dat_for_plot.GendCat == 'MW') | (dat_for_plot.GendCat == 'WM') | (dat_for_plot.GendCat == 'MM'),:] baserate = [6.7, 9.4, 25.5, 58.4] dat_for_baserate_plot['baserate'] = baserate dat_for_baserate_plot = dat_for_baserate_plot.assign(citation_rel_to_baserate= dat_for_baserate_plot.percentage - dat_for_baserate_plot.baserate )
dat_for_baserate_plot
Feel free to let me know if there's any issues!
Is there a way to manually correct predictions? For example, according to the predictions.csv file, it thinks a particular name is male, whereas I happen to know that this author is female.