fkie-cad / COMIDDS

A comprehensive survey of datasets for research in host-based and/or network-based intrusion detection, with a focus on enterprise networks
MIT License
30 stars 4 forks source link

Add automated citation count #70

Closed Maspital closed 4 months ago

Maspital commented 5 months ago

I have to say I'm quite pleased with how well this turned out :smile:

image

It pretty much works exactly as intended and is sufficiently robust. The only improvement I can think of is actually requesting and using an API key to remove any remaining risk of running into rate limits, but I don't think this is necessary. Still, I did fill out the form for an API key (though have yet to receive an answer) and I've already included the functionality to use such a key in my code - the only work left to do would be to include it as an argument in the workflow (using Secrets, of course).

The code should be pretty readable, but here's a written explanation. In general, how this works is that I've added a new row to the main table which contains the Semantic Scholar Academic Graph (S2AG) paperID for the paper associated with a given dataset. These IDs are then parsed from the table and used to look up whatever information we need about these papers (here, citation information); this was what took me the longest to figure out how to do properly. The obtained info is then further processed by simply looking at which of these citations happened in the last five years, and finally the resulting number is swapped with the respective paperID, which also acted as a placeholder - this only happens during the build process, so this information of course isn't lost.

A quick summary of the changes made:

As this will result in pretty significant merge conflicts in the all_datasets.md file, I'd prefer merging this PR here before anything else.

Resolves #35