arXiv / zzzArchived_arxiv-external-links

Clearinghouse for relations between arXiv e-prints and external resources
MIT License
4 stars 5 forks source link

Use-case: ingest and display data about e-prints from Papers with Code #22

Closed erickpeirson closed 2 years ago

erickpeirson commented 5 years ago

Papers with Code finds code repositories associated with ML papers, including e-prints on arXiv. They make their data available under CC-BY-SA. We should explore what would be involved in incorporating this dataset into arXiv external links, and displaying links to the code repositories on the arXiv abs page of ML papers.

@rstojnic what do you think?

rstojnic commented 5 years ago

Hi @erickpeirson happy to help! If you want to get in touch via email: hello@paperswithcode.com.

erickpeirson commented 5 years ago

A few notes on this, for when we get closer to implementation...

This is a good test-case for integrating relations from an external platform that continuously extends/improves their data, and that provides a much richer data structure than we can reasonably accommodate (or would want to replicate). In addition to the complexity of data maintenance and consistency, there is value in connecting end users directly to the data provider, e.g. to generate awareness of the value of distributed infrastructure, and to foster the community that is generating/curating the data.

Rather than ingesting relations between e-prints and individual code resources, therefore, we should focus on adding relations between e-prints and the PwC resource for that e-print. In other words, instead of:

arXiv e-print ---> GitHub repo 1
arXiv e-print ---> GitHub repo 2
arXiv e-print ---> GitHub repo 3

we would do:

                                         |-> GitHub repo 1
arXiv e-print ---> PwC view for e-print -|-> GitHub repo 2
                                         |-> GitHub repo 3 

We will still want to either consult the PwC data dump or (if available in the future) hit their API. But with the objective of identifying which e-prints are represented in their dataset rather than pulling in each individual link.

bdc34 commented 2 years ago

This is done in labs.