Open jackmerrill opened 1 year ago
Data stored on a per-user basis should be encrypted. This is law and I'll make that more clear in the fedlex soon (dictating how user data should be handled by all persons hosting services and all that; a GDPR-esque policy, if you will.)
Data stored by users into things like media libraries can be encrypted, but do not need to be. They are not user identifiable by any means.
@cyckl Can you make a mockup for a basic interface for:
Going off that, do we want a basic API and/or website to control everything, as well as a CLI? Also, once the IdP is set up, I can implement that too.
The basic API and website can held off for now, others can help implement those; the server administration stuff should still have CLI "manage.py"-esque settings. As long as those call functions, building APIs for that shouldn't be too hard.
IdP is/will be done via OIDC/OAUTH2, so you should be able to have dummy logic for either or. LDAP is also an option since we'll likely keep it for all intensive purposes, but it's best to use OAUTH scopes and use the SSO service.
Okay cool.
I'll authenticate with SSO/OAuth/whatever in the style that Tailscale does it, if it's possible. I'll likely split the project into two, a daemon and a CLI.
Suggestions are appreciated.
Pretend Tailscale doesn't exist for all intensive purposes of development. IdP, nor SSO, will ever be handled through it. At most, we'll move from GitHub to our SSO solution for authenticating with Tailscale. This is something I've considered for awhile and have had scattered talks with the folks over there about enabling.
The split is likely necessary and sensible. You could also have those settings configurable with flags to the daemon, requiring restarts and downtime, which is a fine "issue".
I only considered a daemon for detached processes (i.e. docker run -d
) but honestly we could be just fine without it. I'll make sure the program is dockerized, in case we want to use that.
Dockerised is fine. Container plans are scattered at best right now; I'll formulate some concrete plans to propose sometime in the month. For ref., see #4.
Alright, I'll start looking into this today. I'll update this issue with any findings.
After some research, here's what I'm likely going to end up implementing:
server1.example.com
joins the mesh at mesh.example.com
mesh.example.com
then tells all other clients of server1.example.com
It is notable to have some sort of versioning system. Quicker rollbacks for both per-user and media-wise.
Checking in to see how development is going @jackmerrill.
@cyckl has worked on sites here and there; he'll likely put a Figma file here soon.
so RClone does literally what we want: https://github.com/rclone/rclone
with this discovery i've come to the current conclusion:
thoughts? @doamatto @cyckl
What types of data are we mirroring, exactly? Is this for our media libraries? Something like a CDN? Static mirrors for general downloads?
The reason I ask is because it's pretty important in order to build some type of web UI and it'll influence how well RClone serves our needs.
If it's something like our full media libraries, then we'd essentially be duplicating terabytes of data for every mirror.
@cyckl I'd assume its for any kind of data, so media included.
Worst case I can make a wrapper or something for RClone, but there is also this: https://rclone.org/gui/
We can also control RClone via an API: https://rclone.org/rc/
Started experimenting with Rclone, its really easy to set up, especially with the web interface.
TL;DR we want to create however many remotes we want, then pipe those into a union remote, then pipe that into an encryption remote. For existing files, we can just copy the files from the remote to the encrypted remote.
Encrypted files are only visible to those that have the right keys. I'd assume sharing those keys would allow for multiple users to see the same files.
What types of data are we mirroring, exactly? Is this for our media libraries? Something like a CDN? Static mirrors for general downloads? In lamest terms, any files needed to run services or hosted by a user. That would include our media libraries, The Libraries, the Fedlex, et al.
A web UI is pertinent, as well as access through conventional file sharing protocols (Samba, FTPS, AFP, to name a few contenders). The web UI is the only way a user could reliably configure their data settings, bar sec.gouv.fa. These settings should be kept with the fileserver, however.
RClone, to my knowledge, is meant for very small scale solutions like syncing a Google Drive folder offline, backing up a NAS, et al. As far as I'm aware, it it isn't intended to be used as the file server, the mirror, and the interface middleware. I'm not really sure if it can fit the needs accurately and fully.
We can also control RClone via an API
It seems this is only basic authentication (username and password), not OIDC/token-based or a database like AD or LDAP. It wouldn't be sufficient.
Worst case I can make a wrapper or something for RClone, but there is also this: https://rclone.org/gui/
Going off the images I'm seeing of it online, it looks like it's intended as a single-user mode and is intended for what I eluded to before: lots of small tasks meant for individuals and far from scale.
Encrypted files are only visible to those that have the right keys. I'd assume sharing those keys would allow for multiple users to see the same files.
We'll set up 1Password or an in-house [Vault]warden to store important tokens and keys for staff in due time; this is a non-issue, in theory.
@doamatto Does this help at all? https://forum.rclone.org/t/rclone-on-large-scale-linux-desktops-azuread-auth/18301
I should have been more specific in terms of the authentication issue via the API: it should be token-based (ie. OIDC) before we rely on having to expand the scope of an already complicated OAUTH server. Ideally, we shouldn't need to develop additions for LDAP or AD authentication. We also shouldn't be using OneDrive at any step in this, unless I missed something.
We also shouldn't be using OneDrive at any step in this, unless I missed something.
Was only an example.
it should be token-based
I'll look into this.
RClone WebDAV has OIDC authentication. Also I found this: https://github.com/HarryKodden/encrypted-mounting
OR could we have a main server that everyone connects to, with whatever method we inevitably choose, and that main server connects to all the other servers via RClone?
Are there any mounting methods (i forgot the term, but like Samba, SMB, WebDAV, etc) that support OIDC/LDAP/OAuth/et al. authentication?
No file sharing protocol to my knowledge, bar proprietary ones like "Google Drive" or "OneDrive", support anything other than basic authentication. It's the web interface where basic authentication is insufficient. Every official service, frankly any service in Farer, should use SSO, not basic authentication. This is especially the case where we have a say in it, like a website, like a web UI. An API can use SSO since a client ID and client secret can be used to gather a token for token-based authentication based on SSO.
No file sharing protocol to my knowledge, bar proprietary ones like "Google Drive" or "OneDrive", support anything other than basic authentication.
Well, as I stated before, WebDAV has OIDC support apparently.
Regardless, how idiot-proof do we want / need to make this? Could a local Web UI work, or does it have to be like https://drive.fa
? Would a dummy username/password system work? Where a user signs into the web ui, then they can get some generated credentials to use for authentication with whatever their OS supports?
Any other ideas?
Logging into WebDAV does for Rsync. As for actual clients and servers, I don't think I've ever seen such a thing for any part of the WebDAV standard (including Cal and Card). App-based passwords are fine, and likely necessary.
A local web UI = an Electron app, which doesn't make sense when we can make things native ourselves.
I'm just failing to see how Rsync can be used as the web UI, the server, and the syncing method. Perhaps a flowchart is in order?
Web UI: For setup and management Server: Single source. For example, we only set up one (or a few) main node[s], and connects to the storage servers, then it mounts that to something all users can connect to (NFS, etc) Syncing: See above.
So this is meant for just mirroring data, not the hosting of or access to data.
Which in that case, if it's only to be used by staff, Rclone is probably what we want.
If this is something user facing, we can still use Rclone but we'd need to work around some stuff.
It wouldn't be exclusive to staff— individuals can help expand the storage we have and build resilience at their want.
Would users be required to contact a staff member to add to the network? Would users be able to use their added storage, along with the entire storage network?
For example, if we did the Server / single source method, we'd just update the config(s) with the new server.
I'm really thinking the single source method is the best bet to have the auth method we want.
Users wouldn't need to, they would just follow the setup instructions and be done. Pooling storage that is encrypted and not having them hold any of the decryption keys means that there is virtually no security worry, per se.
A "single-source" is necessary for any kind of traditional replication.
So is this for pooling storage (i.e. 20gb on Server A and 40gb on Server B turns into 60gb on Server C), or for mirroring data around the world?
Both, ideally and realistically. If there's more storage that can be used by users, it should be made available. And, naturally, mirroring is a general form of backup that is necessary.
So hows this:
We have two "folders", mirror
and pool
, anything put in mirror
will be mirrored across all available (storage wise, primarily) servers, and anything put in pool
is pooled across all available servers.
But how would a server know if it can mirror to other services? Or if there are conflicts between mirrored data from another server and pooled data from the current server? It seems fragile and out of scope of rsync. If you think a POC is viable, then go right ahead— I just don't see it personally.
But how would a server know if it can mirror to other services
It'll find out the hard way I guess. If it can't mirror (storage, down, etc) then it'll simply fail.
Or if there are conflicts between mirrored data from another server and pooled data from the current server?
Maybe we only have one R/W mirror as the authoritative server? Each server will have two directories in their data share- /mirror
and /pool
, so there shouldn't be any conflicts between mirrored data and pooled data. All pooled data will be in a user's own folder, so no conflicts should arise between other users' data. Of course, all data, at least pooled data, will be encrypted.
I'll see if I can work out a PoC.
then it'll simply fail.
This is VERY dangerous and not something acceptable unless it does so gracefully. If I'm moving a copy of a vital document, say a nationality document, but it only copies in part, with both servers crapping the toilet, there is a significant process to re-obtain that digital copy of the document. Naturally an extreme scenario, but all data should be treated with the care that we would for utmost vital documents.
https://github.com/LLEB-ME/rsync-poc
Can I get a general timeframe for a PoC? And, as a side note, please use Ansible for the "build" process so that it is simple to deploy internally and relatively easy for end-users to deploy to their own servers, as well.
I think there was a miscommunication- I'm using rclone not rsync.
Anyways here's the PoC with rclone:
[test-on-vps]
type = sftp
host = ...
pass = ...
user = root
shell_type = unix
md5sum_command = md5sum
sha1sum_command = sha1sum
[drive-test]
type = drive
client_id = ...
client_secret = ...
token = ...
team_drive =
[mirror]
type = local
[pool]
type = union
upstreams = drive-test:/ test-on-vps:/
[test]
type = combine
upstreams = mirror=mirror:/ pool=pool-enc:/
[pool-enc]
type = crypt
password = xxxxxx
remote = pool:/rclone/jack
mirror
would be synced with rclone sync
on a cron job.
Regarding the failing thing, I really have no idea how rclone works for that, so some more research is due for that.
We've had some issues with Ceph and, by extension, Rook in the past, but it's an avenue that most everyone is comfortable with. Do you think you can a POC and see which avenue is the best for moving forward?
I have no idea how I'd make a POC with something like this? It might just be a case of "try and see".
Important notes and takeaways from discussion in #int-general the other day:
We shouldn't use a chatbot for ideas
i was testing okay
IPFS is a generally okay solution
Right, I wasn't wanting to use IPFS if possible. Seems to heavy, hence why Ceph, SeaFile, et. al are good options.
If we're gonna use Ceph: https://www.marksei.com/how-to-install-ceph-with-ceph-ansible/
In the inevitable case that we expand to more users around the country (and world), we'd want to utilize donated storage to have download mirrors closer to end-users.
I am proposing a data mirroring script that syncs (to the best ability) all data between storage servers.
What to consider:
Repository: https://github.com/LLEB-ME/fileserver