Research content storage options

IPFS notes

NOTE Read about 80% of the documentation and related materials at this point, so feel confident that my understanding of the basics is sound, but will update this if needed.

IPFS stands for InterPlanetary File System, and it is a distributed peer-to-peer file sharing protocol. One of the project's goals is to provide a decentralized alternative to today's web.¹

^{[1]: Originally, the internet was "envisioned the internet as a patchwork of decentralized networks", instead the traffic and the majority of the infrastructure is now in the hands of a handful of companies.}

1. How IPFS works

The IPFS website has a very good introduction but the (super oversimplified) gist is that content (mostly files) shared in IPFS networks get cut up into smaller chunks ("blocks" in IPFS lingo) which will be shared indiscriminately with every participating node (i.e., with every person who has installed the IPFS desktop application or uses an IPFS implementation in some other way).

2. Evaluation

The quick "score card" based on the criteria in the issue description:

#	criterium	verdict
1	reduce costs	yes, but with extra effort
2	vendor-agnostic	yes
3	fault tolerance	yes, but with extra effort
4	optional encryption	yes, but with extra effort
5	crowd-sourceable	yes
6	media access	yes (but with extra effort?)
7	handle volunteer submissions	maybe

2.1 (no. 1 & 3) Reduce costs and fault tolerance

There is a great summary in this Hacker News post but to elaborate:

Even though the IPFS model works in a BitTorrent-like manner, caching blocks on each node that a user request goes through, the content requested still has to be stored somewhere, and it is usually on the computer's local storage of the user who originally shared it.

When the "sharing user" adds a file to the IPFS network, it gets split into blocks, and the addresses of each block gets advertised to the network. When this file is reguested remotely, these blocks may travel on completely different paths, probably over multiple nodes (this part is still fuzzy), to the requester. Nodes cache every block that goes through them, so next time a user needs this file, the blocks may be retrieved from nodes much closer to them.

To prevent the node caches indefinitely (and consume all the storage of a user), there is a garbage collection process that periodically (every 90 minutes, by default) deletes blocks that haven't been referenced by any request for a certain amount of time. If someone would like to prevent this (e.g., because they need the files later), files can be pinned.

The corollary of block caching and user pinning is that when previously shared file is deleted from the origin node, it doesn't mean that it is truly gone. If its blocks are still cached in the network when the file is requested or if it is pinned on at least one node, it can still be retrieved.

2.1.1 Why the "extra effort" caveat then?

Because this requires coordination of pinning among "storage donors" (i.e., volunteers, participating audio reading services and organizations, etc.):

The data set is quite sizable, and we can't expect everyone to host terabytes of audio.
Not everyone (audio reading services or individuals) will be interested in every content. By "individuals", I mean subscribers but also members of the public who elect to listen to content directly from IPFS, as the majority of the content is not copyrighted.

(As for listening to IPFS streams, it would probably be possible with diffuse or the online version, diffuse.sh; TODO: research.)
What happens if a "storage donor" quits?

By the nature of volunteer commitments, this is absolutely fine, but what if they were the only one pinning a particular set of audio content? And while we are here:
How to know how many "storage donors" are pinning what files?

IPFS doesn't have a native way of knowing who pins what; it is a collection of low-level protocols to build services over it. Asked for clarification in this post on the IPFS forum.
"Storage donors" may choose to pin (a subset of) the dataset on their computers, which may not be turned on all the time

This scenario is fine from the redundancy standpoint (the data is there), but if they are the only pinning that particular content, then it may be unavailable for folks who want to listen to it
"uncoordinated" pins will always be encouraged but cannot be relied on

TODO: This coordination could be done manually (let's try to dismiss this out of hand), or by building an application layer over IPFS. Again, asked about this in this post on the IPFS forum.

2.2 (no. 2) Vendor-agnosticity

It doesn't matte; where IPFS nodes are running (home computer, cloud, etc.) and where they store the service's audio content (local filesystem, database, etc.), until "storage donors" are pinning whatever they agreed to pin.

2.3 (no. 4) Optional encryption

See last paragraph in this section from the IPFS documentation. The gist is that anything one shares in the IPFS network is public available, but files can be encrypted before sharing.

There is a warning in the article above that encryption gets weaker by the invention of newer methods of breaking them and by improvements in computer hardware; this is not a concern for us though as we are not sharing sensitive data, and only doing this to comply with 17 U.S. Code §121 (see this and this).

2.3.1 Why the "extra effort" caveat?

Because

A scheme will need to be drawn up how to use public key encryption between participating organizations (and tech-savvy subscribers).

This is similar to public key encryption when securing email communications, but with the added twist of encrypting a "message" to be sent to multiple recipient. This is not a novel problem (see this or this thread), but I only know the basics of cryptography.
There will be a technical overhead for eligible parties (again, organizations and adventurous subscribers) to decrypt the files.

Some decryption methods that come to mind:
- on request, right before playback, the file(s) are decrypted into a temporary storage, that gets purged after file finished playing (either periodically or right away)
- all the copyrighted files are decrypted onto a private storage, and played back from there
Both methods increase complexity significantly, and ready-made tools (either by us or someone else) most of this content-sharing idea is stillborn. Also, the second method is perhaps simpler, but it also means that twice as much storage space is needed (each encrypted file will have to be kept for continue sharing it over IPFS, and decrypted files will have to be stored at a separate place).

PROPOSAL / TODO: Use Tahoe-LAFS for copyrighted content. As far as I know, files / directories can be shared (with read / write permissions) with others where the Tahoe-LAFS application takes care of the encryption automatically, and these parties will be responsible to keep your end secure.

Some questions:
- Is the summary above correct?
- Can one stream files from a Tahoe-LAFS share?
- How involved is it to create a Tahoe-LAFS "grid"?

2.4 (no. 5) Crowd-sourceable

No issues here, and most the workings are explained above.

2.5 (no. 6) Media access

Don't know about the specifics yet, but it can be done:

TODO: There is also diffuse (or the online version, diffuse.sh) that streams ... directly from IPFS? Look into this.

2.6 (no. 7) Handle volunteer submissions

2.6.1 Volunteer submissions directly into IPFS

Nothing prevents volunteers to install an IPFS node and start sharing their recordings, but this should be coordinated (or even discouraged) so that they don't accidentally commit copyright violation; haven't heard of any such precedents on IPFS, but still; no one should get into trouble because they are trying to help us.

Therefore, there should be clear guidelines on how to encrypt copyrighted files before sharing, but this already raises a lot of questions:

See all the questions / concerns at section "2.3 Optional encryption" above.
TODO How can such a submission fit into the audio content data set? Can someone share files into someone else's directory? (I think the answer is yes, but will have to try it out.)
Submissions will need to be added to the pin-list(s) / pinset(s) to be coordinated. (That is, when people publicly share their recordings, we / other services need to notified so that it can be pinned, not to get lost when the user decides to delete the file or stop volunteering.)

TODO: Can directories be pinned? It would make things much simpler.

2.6.2 Centralized recording submission (hiding IPFS layer)

This means sticking to the original idea of creating a web, mobile, and desktop applications for volunteers to submit their recordings, and we'll take care of sharing recordings (and optional encryption) on the back-end (or on the file; e.g., there is JS-IPFS to be used in the browser, but also embeddable implementations for other purposes).

access-news / _