Rescue Mission for Sci-Hub and Open Science

We need to do something for Open Access.

Background

Sci-Hub is a shadow library website that provides free access to millions of research papers and books, without regard to copyright, by bypassing publishers' paywalls in various ways. Sci-Hub was founded by Alexandra Elbakyan in 2011 in Kazakhstan in response to the high cost of research papers behind paywalls.

from Wikipedia

On May 7th, Sci-Hub's Alexandra Elbakyan revealed that the FBI has been wiretapping her accounts for over 2 years. This news comes after Twitter silenced the official Sci_Hub Twitter account because Indian academics were organizing on it against Elsevier.

Sci-Hub itself is currently frozen and has not downloaded any new articles since December 2020. This rescue mission is focused on seeding the article collection in order to prepare for a potential Sci-Hub shutdown.

from reddit

For now, sci-hub has more than 85,483,812 papers and the total size is up to 77 TB. The Rescue Mission from Reddit uses BitTorrent to distribute papers. They split those papers into 850 sci-hub torrents (every one of them is about 100G). It looks good, but not so enough.

For storage provider: 100GB or 1TB consumes too much (they need to be online)
For end-users: They depend on centralized service to get the paper
For global networks: They can't reuse the already existing data.

Motivation

We can store PDF / Papers on IPFS to avoid been taken down.

IPFS is a P2P hypermedia protocol:

IPFS address file/content via their content hash, no file will be corrupted.
IPFS transfers data in a P2P way instead of a centralized node.
IPFS can remove duplications via their content hash.

So IPFS is a good fit for us.

Option: IPFS cluster

We can set up an IPFS cluster holding the whole dataset and allow users to set up their own.

This method:

Require the user to have an IPFS cluster storing 77TB data.
Allow the user to build an API upon data.
Allow the user to fetch single paper by it's hash

Option: IPFS Index

We only maintain the index of papers:

DOI -> Paper Hash
Title -> Paper Hash
... -> Paper Hash

And we can provide APIs including :

Insert new papers
Query paper via DOI / Titles / ...

The difference from IPFS cluster is, in this way, we only maintain the index/database of papers.

More: we can build a distributed DB over IPFS (maybe OrbitDB).

Related projects

After go-serivce-ipfs has been implemented, we can operate on data from IPFS.
After https://github.com/beyondstorage/specs/issues/53 implemented, we can store data in IPFS via go-storage as backend.

beyondstorage / go-storage