Open toraritte opened 2 years ago
NOTE Read about 80% of the documentation and related materials at this point, so feel confident that my understanding of the basics is sound, but will update this if needed.
IPFS stands for InterPlanetary File System, and it is a distributed peer-to-peer file sharing protocol. One of the project's goals is to provide a decentralized alternative to today's web.1
[1]: Originally, the internet was "envisioned the internet as a patchwork of decentralized networks", instead the traffic and the majority of the infrastructure is now in the hands of a handful of companies.
The IPFS website has a very good introduction but the (super oversimplified) gist is that content (mostly files) shared in IPFS networks get cut up into smaller chunks ("blocks" in IPFS lingo) which will be shared indiscriminately with every participating node (i.e., with every person who has installed the IPFS desktop application or uses an IPFS implementation in some other way).
The quick "score card" based on the criteria in the issue description:
# | criterium | verdict |
---|---|---|
1 | reduce costs | yes, but with extra effort |
2 | vendor-agnostic | yes |
3 | fault tolerance | yes, but with extra effort |
4 | optional encryption | yes, but with extra effort |
5 | crowd-sourceable | yes |
6 | media access | yes (but with extra effort?) |
7 | handle volunteer submissions | maybe |
There is a great summary in this Hacker News post but to elaborate:
Even though the IPFS model works in a BitTorrent-like manner, caching blocks on each node that a user request goes through, the content requested still has to be stored somewhere, and it is usually on the computer's local storage of the user who originally shared it.
When the "sharing user" adds a file to the IPFS network, it gets split into blocks, and the addresses of each block gets advertised to the network. When this file is reguested remotely, these blocks may travel on completely different paths, probably over multiple nodes (this part is still fuzzy), to the requester. Nodes cache every block that goes through them, so next time a user needs this file, the blocks may be retrieved from nodes much closer to them.
To prevent the node caches indefinitely (and consume all the storage of a user), there is a garbage collection process that periodically (every 90 minutes, by default) deletes blocks that haven't been referenced by any request for a certain amount of time. If someone would like to prevent this (e.g., because they need the files later), files can be pinned.
The corollary of block caching and user pinning is that when previously shared file is deleted from the origin node, it doesn't mean that it is truly gone. If its blocks are still cached in the network when the file is requested or if it is pinned on at least one node, it can still be retrieved.
Because this requires coordination of pinning among "storage donors" (i.e., volunteers, participating audio reading services and organizations, etc.):
The data set is quite sizable, and we can't expect everyone to host terabytes of audio.
Not everyone (audio reading services or individuals) will be interested in every content. By "individuals", I mean subscribers but also members of the public who elect to listen to content directly from IPFS, as the majority of the content is not copyrighted.
(As for listening to IPFS streams, it would probably be possible with diffuse or the online version, diffuse.sh; TODO: research.)
What happens if a "storage donor" quits?
By the nature of volunteer commitments, this is absolutely fine, but what if they were the only one pinning a particular set of audio content? And while we are here:
How to know how many "storage donors" are pinning what files?
IPFS doesn't have a native way of knowing who pins what; it is a collection of low-level protocols to build services over it. Asked for clarification in this post on the IPFS forum.
"Storage donors" may choose to pin (a subset of) the dataset on their computers, which may not be turned on all the time
This scenario is fine from the redundancy standpoint (the data is there), but if they are the only pinning that particular content, then it may be unavailable for folks who want to listen to it
"uncoordinated" pins will always be encouraged but cannot be relied on
TODO: This coordination could be done manually (let's try to dismiss this out of hand), or by building an application layer over IPFS. Again, asked about this in this post on the IPFS forum.
It doesn't matte; where IPFS nodes are running (home computer, cloud, etc.) and where they store the service's audio content (local filesystem, database, etc.), until "storage donors" are pinning whatever they agreed to pin.
See last paragraph in this section from the IPFS documentation. The gist is that anything one shares in the IPFS network is public available, but files can be encrypted before sharing.
There is a warning in the article above that encryption gets weaker by the invention of newer methods of breaking them and by improvements in computer hardware; this is not a concern for us though as we are not sharing sensitive data, and only doing this to comply with 17 U.S. Code §121 (see this and this).
Because
A scheme will need to be drawn up how to use public key encryption between participating organizations (and tech-savvy subscribers).
This is similar to public key encryption when securing email communications, but with the added twist of encrypting a "message" to be sent to multiple recipient. This is not a novel problem (see this or this thread), but I only know the basics of cryptography.
There will be a technical overhead for eligible parties (again, organizations and adventurous subscribers) to decrypt the files.
Some decryption methods that come to mind:
on request, right before playback, the file(s) are decrypted into a temporary storage, that gets purged after file finished playing (either periodically or right away)
all the copyrighted files are decrypted onto a private storage, and played back from there
Both methods increase complexity significantly, and ready-made tools (either by us or someone else) most of this content-sharing idea is stillborn. Also, the second method is perhaps simpler, but it also means that twice as much storage space is needed (each encrypted file will have to be kept for continue sharing it over IPFS, and decrypted files will have to be stored at a separate place).
PROPOSAL / TODO: Use Tahoe-LAFS for copyrighted content. As far as I know, files / directories can be shared (with read / write permissions) with others where the Tahoe-LAFS application takes care of the encryption automatically, and these parties will be responsible to keep your end secure.
Some questions:
No issues here, and most the workings are explained above.
Don't know about the specifics yet, but it can be done:
TODO: There is also diffuse (or the online version, diffuse.sh) that streams ... directly from IPFS? Look into this.
Nothing prevents volunteers to install an IPFS node and start sharing their recordings, but this should be coordinated (or even discouraged) so that they don't accidentally commit copyright violation; haven't heard of any such precedents on IPFS, but still; no one should get into trouble because they are trying to help us.
Therefore, there should be clear guidelines on how to encrypt copyrighted files before sharing, but this already raises a lot of questions:
See all the questions / concerns at section "2.3 Optional encryption" above.
TODO How can such a submission fit into the audio content data set? Can someone share files into someone else's directory? (I think the answer is yes, but will have to try it out.)
Submissions will need to be added to the pin-list(s) / pinset(s) to be coordinated. (That is, when people publicly share their recordings, we / other services need to notified so that it can be pinned, not to get lost when the user decides to delete the file or stop volunteering.)
TODO: Can directories be pinned? It would make things much simpler.
This means sticking to the original idea of creating a web, mobile, and desktop applications for volunteers to submit their recordings, and we'll take care of sharing recordings (and optional encryption) on the back-end (or on the file; e.g., there is JS-IPFS to be used in the browser, but also embeddable implementations for other purposes).
Access News content is currently scattered all over: OneDrive, Google Drive, internal server, TR2 production server. My OneDrive account is already running out of space, and the organization of files leaves much to be desired everywhere.
Requirements for a solution to be considered:
reduce costs There are "cold storage" tiers offered by most major cloud vendors making storage cost very low, but a downside to it is that access is not readily available, and it would also be hard to share with other organizations.
vendor-agnostic If we need to migrate to another cloud vendor, then terabytes of data will need to be copied, but this is also an issue when it comes to:
fault tolerance Probably every cloud vendor has disaster recovery solutions in place, but setting it up means extra costs and would exacerbate future migrations to other vendors if necessary. Also, Access News media is neither highly sensitive information nor is it a disaster if it gets lost; although it would be quite an inconvenience and so it would be nice to plan to keep it intact. The only criterium is:
optional encryption Large portion of the content is freely available (e.g., old time radio shows) or should be (e.g., store sale ads, free newspaper and magazine articles), and thus these could be disseminated openly, even to folks who are not subscribers. Copyrighted material on the other hand needs to be encrypted and only be provided access to individuals with an eligible disability. (legal foundations of reading services in the US: short, long)
crowd-source-able As the majority of the content is in the public domain (and if optional encryption is possible), it should be possible for folks to volunteer as "storage donors".
human- and machine-friendly access to media in the content catalog One should be able to browse the catalog directly (like a file system, through a web interface, etc.) and play the media in any way (online stream in audio players that support it, from a website, on the command line, etc.). (Copyrighted content would of course need authentication.) This covers the human aspect, but the catalog should also be access through frontends, such as TR2, website, mobile apps etc. so items (1) should be streamed through the network and (2) have an access method that is widely supported (e.g., HTTP/S links are supported by FreeSWITCH).
handle volunteer submissions Can IPFS or Tahoe-LAFS (see below) handle this, either directly or indirectly?
Currently investigating IPFS and Tahoe-LAFS as they both appear to satisfy the criteria above, and they can even be used together. (A somewhat off-topic thread to work copyrighted content.)