best-of-lists / best-of-generator

🏆 Generates a ranked list of awesome libraries and tools.
https://best-of.org
GNU General Public License v3.0
68 stars 11 forks source link

A publication metric for scientific best-of lists and projects #75

Open Irratzo opened 7 months ago

Irratzo commented 7 months ago

Problem and motivation:

The scientific community / research software development (RSE) community also makes extensive use of awesome-like lists. These typically collect publications, repositories, general resources, or a mix of those. Here are four random examples to showcase some of the commonly used list setups. graph-based-deep-learning-literature, https://github.com/neurreps/awesome-neural-geometry, awesome-materials-informatics, AI4Science resources.

These project's READMEs are static rather than CI-based and metrics-ordered, the distinguishing feature of best-of lists. For some, but not all such scientific lists, the best-of approach would be beneficial. Here are some random examples. best-of-atomistic-machine-learning.

It has become standard in many scientific fields that code and data associated with a preprint or publication are made available as repositories. So, the current project quality score already serves as a useful indicator in one dimension. However, another dimension even more important to the scientific community is missing sor far: the publication metric.

If the best-of project quality score would take into account publication metrics for any project with a linked publication, this would enhance software-focused scientific lists and open up the best-of template for publication lists, as well.

I propose to use this issue thread for discussion.

Feature description:

I can think of four main challenges to address.

I propose to use this issue thread for discussion of these challenges or add others.

Is this something you're interested in working on?

Yes

Irratzo commented 7 months ago

Proposal 01 Add project property publication_doi.

Description.

Addresses challenges 04 config format.

Extend project properties by property that links to the project's associated preprint or publication via its DOI. An example property name would be publication_id. From this, publication metrics can be extracted. The correspondent project entry in a best-of list could either be an entry for that paper, or an entry for the associated software project.

Example.

The property publication_doi: 10.48550/arXiv.2212.07921 points to the preprint. The correspondent project entry in a scientific best-of list could either be an entry for that preprint, or an entry for the associated software project ORKG with associated software repo gitlab_id: TIBHannover/orkg/orkg-frontend/. In the former case, the project quality score collected via the DOI identifier would be the project quality score. In the latter case, both the preprint and software metrics would be aggregated into a single project quality score.

Problems.

Proposal 01, problem 1. Some fields, e.g. computer science, publish in conferences not journals, and such publications sometimes don't have DOIs. This necessitates some preprocessing to the "standard case" of a given DOI, e.g. search for the publication in publication metric aggregators. Afterwards it is identified, the pipeline should be the same as for DOI.

Irratzo commented 7 months ago

Research 01 Existing solutions for publication metrics definition and collection.

Description.

Addresses challenges 01 definition, 02 data collection, 05 implementation.

I don't know anything about this subject. In abscence of experts, I would do some research on these topics, including already existing solutions on GitHub and other places, and document the research here.

Irratzo commented 7 months ago

Proposal 02 Extend README project link embedding to include property publication_doi.

Description.

Addresses challenges 04 config format.

Depends on proposal 01.

Currently, the project URL embedded in the README is taken from either one of project properties homepage, docs_url, github_id, gitlab_id, in descending priority. The list should be extended to homepage, docs_url, github_id, gitlab_id, publication_id. This is not necessary for projects with a repo. In this case, the other links usually point to the publication. But for publication lists, i.e. lists which only point to papers not software packages, this extension is necessary.

Irratzo commented 7 months ago

Proposal 03 Allow only one publication per project.

Description.

Addresses challenges 04 config format.

Depends on proposal 01.

Some scientific software projects are tied to more than one publication. However, only one publication per list project should be allowed. In case of multi-publication projects, this could for instance be the original (first), or the latest publication. This keeps the implementation much simpler. In case of a software project, the other project links (homepage, docs, repo) can point to the other publications. And if needed, the template user can group projects with the template's group feature, already, if required.

Irratzo commented 7 months ago

Proposal 04 Combine quality scores as a simple weighted sum.

Description.

Addresses challenges 03 score combination.

Depends on proposal 01.

The simplest way to combine the 'software quality score' $s_1$ (current best-of project quality score based on repo and/or package manager metrics) and publication quality score $s_2$ into a combined project quality score is the simple weighted sum $s = (1-\alpha) s_1 + \alpha s_2$ with ratio $\alpha \in [0,1]$ (convex combination) and simplest default $\alpha = 0.5$. In case the project has no publication_doi, the ratio is automatically set to $\alpha = 0$.