This repository contains scripts I used for my Bachelor Thesis research work concerning gender equality within the authors of publications in Computer Science.
Using the DBLP XML data and a NamSor API key (and preferably a company account) you can follow along the entire research by executing the Jupyter Notebooks yourself. Just do the following:
If you want to save pictures of graphs, additionally
To repeat my research (and possibly verify it!), first, execute the regular Notebooks for Data Gathering and Cleaning ("01_DataGatheringAndCleaning"). They are ordered with numbers and the ordering does matter - they need to be executed in the right order, otherwise you might lack data. However, you do not necessarily need to execute the code in the folders "03_01_DataQualityExploration", "03_02_DataImprovementTests", "03_04_ImprovedDataQualityExploration". Note that "03_03_DataImprovements" and "04_01_DataImprovements" are important and necessary to be executed.
Then you can move on to hypotheses tests ("02_HypothesisTests"). You can follow along or create your own hypotheses tests!
Some things like the parsing of XML are slow or do not work at all on some computers. I suspect it's because of the amount of RAM. It worked fine on my computer, but not on a smaller laptop I tried it with. Sorry, it was not my priority to make this faster. Make a pull request if you fix this :D
Feel free to send me feedback and questions!