access-news / _

Repo for Access News project management.
0 stars 1 forks source link

Research content storage options #4

Open toraritte opened 1 year ago

toraritte commented 1 year ago

Access News content is currently scattered all over: OneDrive, Google Drive, internal server, TR2 production server. My OneDrive account is already running out of space, and the organization of files leaves much to be desired everywhere.

Requirements for a solution to be considered:

Currently investigating IPFS and Tahoe-LAFS as they both appear to satisfy the criteria above, and they can even be used together. (A somewhat off-topic thread to work copyrighted content.)

toraritte commented 1 year ago

IPFS notes

NOTE Read about 80% of the documentation and related materials at this point, so feel confident that my understanding of the basics is sound, but will update this if needed.

IPFS stands for InterPlanetary File System, and it is a distributed peer-to-peer file sharing protocol. One of the project's goals is to provide a decentralized alternative to today's web.1

[1]: Originally, the internet was "envisioned the internet as a patchwork of decentralized networks", instead the traffic and the majority of the infrastructure is now in the hands of a handful of companies.

1. How IPFS works

The IPFS website has a very good introduction but the (super oversimplified) gist is that content (mostly files) shared in IPFS networks get cut up into smaller chunks ("blocks" in IPFS lingo) which will be shared indiscriminately with every participating node (i.e., with every person who has installed the IPFS desktop application or uses an IPFS implementation in some other way).

2. Evaluation

The quick "score card" based on the criteria in the issue description:

# criterium verdict
1 reduce costs yes, but with extra effort
2 vendor-agnostic yes
3 fault tolerance yes, but with extra effort
4 optional encryption yes, but with extra effort
5 crowd-sourceable yes
6 media access yes (but with extra effort?)
7 handle volunteer submissions maybe

2.1 (no. 1 & 3) Reduce costs and fault tolerance

There is a great summary in this Hacker News post but to elaborate:

Even though the IPFS model works in a BitTorrent-like manner, caching blocks on each node that a user request goes through, the content requested still has to be stored somewhere, and it is usually on the computer's local storage of the user who originally shared it.

When the "sharing user" adds a file to the IPFS network, it gets split into blocks, and the addresses of each block gets advertised to the network. When this file is reguested remotely, these blocks may travel on completely different paths, probably over multiple nodes (this part is still fuzzy), to the requester. Nodes cache every block that goes through them, so next time a user needs this file, the blocks may be retrieved from nodes much closer to them.

To prevent the node caches indefinitely (and consume all the storage of a user), there is a garbage collection process that periodically (every 90 minutes, by default) deletes blocks that haven't been referenced by any request for a certain amount of time. If someone would like to prevent this (e.g., because they need the files later), files can be pinned.

The corollary of block caching and user pinning is that when previously shared file is deleted from the origin node, it doesn't mean that it is truly gone. If its blocks are still cached in the network when the file is requested or if it is pinned on at least one node, it can still be retrieved.

2.1.1 Why the "extra effort" caveat then?

Because this requires coordination of pinning among "storage donors" (i.e., volunteers, participating audio reading services and organizations, etc.):

TODO: This coordination could be done manually (let's try to dismiss this out of hand), or by building an application layer over IPFS. Again, asked about this in this post on the IPFS forum.

2.2 (no. 2) Vendor-agnosticity

It doesn't matte; where IPFS nodes are running (home computer, cloud, etc.) and where they store the service's audio content (local filesystem, database, etc.), until "storage donors" are pinning whatever they agreed to pin.

2.3 (no. 4) Optional encryption

See last paragraph in this section from the IPFS documentation. The gist is that anything one shares in the IPFS network is public available, but files can be encrypted before sharing.

There is a warning in the article above that encryption gets weaker by the invention of newer methods of breaking them and by improvements in computer hardware; this is not a concern for us though as we are not sharing sensitive data, and only doing this to comply with 17 U.S. Code §121 (see this and this).

2.3.1 Why the "extra effort" caveat?

Because

2.4 (no. 5) Crowd-sourceable

No issues here, and most the workings are explained above.

2.5 (no. 6) Media access

Don't know about the specifics yet, but it can be done:

TODO: There is also diffuse (or the online version, diffuse.sh) that streams ... directly from IPFS? Look into this.

2.6 (no. 7) Handle volunteer submissions

2.6.1 Volunteer submissions directly into IPFS

Nothing prevents volunteers to install an IPFS node and start sharing their recordings, but this should be coordinated (or even discouraged) so that they don't accidentally commit copyright violation; haven't heard of any such precedents on IPFS, but still; no one should get into trouble because they are trying to help us.

Therefore, there should be clear guidelines on how to encrypt copyrighted files before sharing, but this already raises a lot of questions:

2.6.2 Centralized recording submission (hiding IPFS layer)

This means sticking to the original idea of creating a web, mobile, and desktop applications for volunteers to submit their recordings, and we'll take care of sharing recordings (and optional encryption) on the back-end (or on the file; e.g., there is JS-IPFS to be used in the browser, but also embeddable implementations for other purposes).