dalejn / cleanBib

Probabilistically assign gender and race proportions of first/last authors pairs in bibliography entries
MIT License
149 stars 31 forks source link

Tracking papers using this code and the diversity statement #11

Open dalejn opened 4 years ago

dalejn commented 4 years ago

Develop automated methods of collating papers that have used the tool and diversity statement (Python w/ CrossRef API using the Zenodo DOI https://zenodo.org/record/3672110).

We plan to include an analysis of the collection of papers that use to code/diversity statement to compare their citation balance to a random selection of similar papers that do not.

koudyk commented 4 years ago

I'm interested in collaborating on this issue!

koudyk commented 4 years ago

I thought I'd give you an update on what I've done so far @jastiso and @dalejn

Sorry this is so long... too tired to make it shorter :sleeping:

exploring crossref

I was looking on the crossref website, and it says "Our public APIs include Cited-by counts but not the actual works." So I don't think Crossref will be the best tool for this task (unless you know of a feature I haven't found; let me know if so!)

exploring other sources

To get an idea of what kinds of results I should get, I searched for the paper & code DOIs on Google Scholar and using opencitations. Here are the numbers of results, each linked to the url that I used to search:

Google Scholar opencitations.net
paper 13 4
code 4 0

manually getting the citing papers to start

Since there weren't many results, I went through them manually and copied their Diversity Statements (and other data) into this spreadsheet.

observations about the citing papers

visualizing whether papers have statements, and what they cite

I thought it might be useful to see whether articles cite the paper and/or the code, and whether they all have diversity statements, so I made this figure (code). I thought it might help you decide how to search for papers. E.g., one paper with a diversity statement cites the paper, but not the code.

visualizing the percentages listed in citing papers' diversity statements

And I figured I might as well do a visualization (code) of the percentages reported in the diversity statements, since I had manually copied and pasted all the diversity statements into a spreadsheet.

next steps

Next, I'll try finding a way to automate finding papers that cite the doi. I know this was supposed to be the goal, but I did this manual stuff to figure out what I should expect.

dalejn commented 4 years ago

Wow, thank you for working on this, @koudyk. This is awesome! That's a great point to think about in your Venn diagram. Nothing comes immediately to mind, but I wonder if there's a way to automatically search for the boilerplate text of the Diversity Statement itself. That might require a full-text search when the papers are deposited into something like PubMed. Or maybe it's possible to perform through the pre-print servers' PDFs?

For the last visualization on percentages listed in citing papers' diversity statements, could you change the graph to depict the percent over/under-citation compared to expected benchmarks? The expected benchmarks are reported by Dworkin et al. are 6.7% for woman(first)/woman(last), 9.4% for man/woman, 25.5% for woman/man, and 58.4% for man/man. Also, could the predicted gender categories be labeled man/woman? This distinction is made to note that our analysis does not consider sex implied by male/female.

Another interesting thing we could look at is how much impact using the statement and tools had on men compared to women. I wonder if we could port some of the code from cleanBib.ipynb to analyze the first/last-authors' predicted gender of the list of citing papers your code collects. Then, we could make a figure similar to Fig 3 of the Dworkin et al. paper.

koudyk commented 4 years ago

Thanks @dalejn!

"Or maybe it's possible to perform through the pre-print servers' PDFs?" I've heard that it's hard to get text from PDFs, but I've never tried it! I wonder if it might be easier to wait until more citing papers are published, so that their text is available in a more machine-readable format.

Re the visualization, here's a figure with relative percentages. Thanks for pointing out the distinction between man/woman and male/female! I changed the figure labels accordingly.

I didn't get around making separate figures according to the predicted gender of the authors of citing papers. It would be very interesting and I'd be down to revisit it, maybe when there are more citing papers published.