I have to say I'm quite pleased with how well this turned out :smile:
It pretty much works exactly as intended and is sufficiently robust.
The only improvement I can think of is actually requesting and using an API key to remove any remaining risk of running into rate limits, but I don't think this is necessary.
Still, I did fill out the form for an API key (though have yet to receive an answer) and I've already included the functionality to use such a key in my code - the only work left to do would be to include it as an argument in the workflow (using Secrets, of course).
The code should be pretty readable, but here's a written explanation.
In general, how this works is that I've added a new row to the main table which contains the Semantic Scholar Academic Graph (S2AG) paperID for the paper associated with a given dataset.
These IDs are then parsed from the table and used to look up whatever information we need about these papers (here, citation information);
this was what took me the longest to figure out how to do properly.
The obtained info is then further processed by simply looking at which of these citations happened in the last five years, and finally the resulting number is swapped with the respective paperID, which also acted as a placeholder - this only happens during the build process, so this information of course isn't lost.
A quick summary of the changes made:
count_citations.py and citations_fetcher.py do everything described above
Naturally, there's now a new column holding these new values, along with a legend at the bottom describing what this is about
I've decided to swap the "Year" and "TL;DR" columns. I don't have a precise reasoning for this, it just feels better :sweat_smile:
I updated the CSV-generating script as this depends on the order in which information is available in the table
While at it, I've also made it more robust by properly parsing things instead of relying on hardcoded offsets
Modified the main workflow file to execute the new code
Also added a new "Experimental Workflow", so I can test new things without breaking our ability to build a stable version
As this will result in pretty significant merge conflicts in the all_datasets.md file, I'd prefer merging this PR here before anything else.
I have to say I'm quite pleased with how well this turned out :smile:
It pretty much works exactly as intended and is sufficiently robust. The only improvement I can think of is actually requesting and using an API key to remove any remaining risk of running into rate limits, but I don't think this is necessary. Still, I did fill out the form for an API key (though have yet to receive an answer) and I've already included the functionality to use such a key in my code - the only work left to do would be to include it as an argument in the workflow (using Secrets, of course).
The code should be pretty readable, but here's a written explanation. In general, how this works is that I've added a new row to the main table which contains the Semantic Scholar Academic Graph (S2AG) paperID for the paper associated with a given dataset. These IDs are then parsed from the table and used to look up whatever information we need about these papers (here, citation information); this was what took me the longest to figure out how to do properly. The obtained info is then further processed by simply looking at which of these citations happened in the last five years, and finally the resulting number is swapped with the respective paperID, which also acted as a placeholder - this only happens during the build process, so this information of course isn't lost.
A quick summary of the changes made:
count_citations.py
andcitations_fetcher.py
do everything described aboveAs this will result in pretty significant merge conflicts in the
all_datasets.md
file, I'd prefer merging this PR here before anything else.Resolves #35