RFC: Data Mirroring Script

jackmerrill commented 1 year ago

In the inevitable case that we expand to more users around the country (and world), we'd want to utilize donated storage to have download mirrors closer to end-users.

I am proposing a data mirroring script that syncs (to the best ability) all data between storage servers.

What to consider:

Users will have differing storage amounts
- To mitigate this, allow users to set the max storage for their mirror
Users may want to change where the data is stored
- To mitigate this, allow users to set where the mirror data is stored
How would data be synced? Is there a "master" copy, or can all users modify the data?
Will data be encrypted, or is all data available to everyone?

Repository: https://github.com/LLEB-ME/fileserver

doamatto commented 1 year ago

Data stored on a per-user basis should be encrypted. This is law and I'll make that more clear in the fedlex soon (dictating how user data should be handled by all persons hosting services and all that; a GDPR-esque policy, if you will.)

Data stored by users into things like media libraries can be encrypted, but do not need to be. They are not user identifiable by any means.

doamatto commented 1 year ago

@cyckl Can you make a mockup for a basic interface for:

simple file ops (seeing whats stored in the media folder, seeing what you have stored, downloading per file online)
managing the server, if accessed by proper user (establishing region based on future policy— see Notion, setting max mirror size, other basic settings)
managing user settings (establishing home region for data, optting-out of storage regions/servers, et al.; account settings are saved for sso.gouv.fa or similar)

jackmerrill commented 1 year ago

Going off that, do we want a basic API and/or website to control everything, as well as a CLI? Also, once the IdP is set up, I can implement that too.

doamatto commented 1 year ago

The basic API and website can held off for now, others can help implement those; the server administration stuff should still have CLI "manage.py"-esque settings. As long as those call functions, building APIs for that shouldn't be too hard.

IdP is/will be done via OIDC/OAUTH2, so you should be able to have dummy logic for either or. LDAP is also an option since we'll likely keep it for all intensive purposes, but it's best to use OAUTH scopes and use the SSO service.

jackmerrill commented 1 year ago

Okay cool.

I'll authenticate with SSO/OAuth/whatever in the style that Tailscale does it, if it's possible. I'll likely split the project into two, a daemon and a CLI.

Suggestions are appreciated.

doamatto commented 1 year ago

Pretend Tailscale doesn't exist for all intensive purposes of development. IdP, nor SSO, will ever be handled through it. At most, we'll move from GitHub to our SSO solution for authenticating with Tailscale. This is something I've considered for awhile and have had scattered talks with the folks over there about enabling.

The split is likely necessary and sensible. You could also have those settings configurable with flags to the daemon, requiring restarts and downtime, which is a fine "issue".

jackmerrill commented 1 year ago

I only considered a daemon for detached processes (i.e. docker run -d) but honestly we could be just fine without it. I'll make sure the program is dockerized, in case we want to use that.

doamatto commented 1 year ago

Dockerised is fine. Container plans are scattered at best right now; I'll formulate some concrete plans to propose sometime in the month. For ref., see #4.

jackmerrill commented 1 year ago

Alright, I'll start looking into this today. I'll update this issue with any findings.

jackmerrill commented 1 year ago

After some research, here's what I'm likely going to end up implementing:

rsync transfer layer
gin backend / api server
some custom client
each instance is both a server and client, with the primary instance being authoritative.
- this creates a sort of "mesh"
- with this, maybe all the servers could pass around their details?
- example, server1.example.com joins the mesh at mesh.example.com
- mesh.example.com then tells all other clients of server1.example.com
- other clients could initialize rsync connections?

doamatto commented 1 year ago

It is notable to have some sort of versioning system. Quicker rollbacks for both per-user and media-wise.

doamatto commented 1 year ago

Checking in to see how development is going @jackmerrill.

@cyckl has worked on sites here and there; he'll likely put a Figma file here soon.

jackmerrill commented 1 year ago

so RClone does literally what we want: https://github.com/rclone/rclone

jackmerrill commented 1 year ago

with this discovery i've come to the current conclusion:

we use RClone to manage everything
share instructions on how to set up RClone
backends:
- Crypt (encryption)
- Union (distributed)
- FUSE (virutal drive, optional)
- Compress (compression, optional)
multiple storage backends with RClone
Docker only? Or however storage can be limited

thoughts? @doamatto @cyckl

cyckl commented 1 year ago

What types of data are we mirroring, exactly? Is this for our media libraries? Something like a CDN? Static mirrors for general downloads?

The reason I ask is because it's pretty important in order to build some type of web UI and it'll influence how well RClone serves our needs.

If it's something like our full media libraries, then we'd essentially be duplicating terabytes of data for every mirror.

jackmerrill commented 1 year ago

@cyckl I'd assume its for any kind of data, so media included.

Worst case I can make a wrapper or something for RClone, but there is also this: https://rclone.org/gui/

jackmerrill commented 1 year ago

We can also control RClone via an API: https://rclone.org/rc/

jackmerrill commented 1 year ago

Started experimenting with Rclone, its really easy to set up, especially with the web interface.

TL;DR we want to create however many remotes we want, then pipe those into a union remote, then pipe that into an encryption remote. For existing files, we can just copy the files from the remote to the encrypted remote.

Encrypted files are only visible to those that have the right keys. I'd assume sharing those keys would allow for multiple users to see the same files.

doamatto commented 1 year ago

What types of data are we mirroring, exactly? Is this for our media libraries? Something like a CDN? Static mirrors for general downloads? In lamest terms, any files needed to run services or hosted by a user. That would include our media libraries, The Libraries, the Fedlex, et al.

A web UI is pertinent, as well as access through conventional file sharing protocols (Samba, FTPS, AFP, to name a few contenders). The web UI is the only way a user could reliably configure their data settings, bar sec.gouv.fa. These settings should be kept with the fileserver, however.

RClone, to my knowledge, is meant for very small scale solutions like syncing a Google Drive folder offline, backing up a NAS, et al. As far as I'm aware, it it isn't intended to be used as the file server, the mirror, and the interface middleware. I'm not really sure if it can fit the needs accurately and fully.

We can also control RClone via an API

It seems this is only basic authentication (username and password), not OIDC/token-based or a database like AD or LDAP. It wouldn't be sufficient.

Worst case I can make a wrapper or something for RClone, but there is also this: https://rclone.org/gui/

Going off the images I'm seeing of it online, it looks like it's intended as a single-user mode and is intended for what I eluded to before: lots of small tasks meant for individuals and far from scale.

doamatto commented 1 year ago

Encrypted files are only visible to those that have the right keys. I'd assume sharing those keys would allow for multiple users to see the same files.

We'll set up 1Password or an in-house [Vault]warden to store important tokens and keys for staff in due time; this is a non-issue, in theory.

jackmerrill commented 1 year ago

@doamatto Does this help at all? https://forum.rclone.org/t/rclone-on-large-scale-linux-desktops-azuread-auth/18301

doamatto commented 1 year ago

I should have been more specific in terms of the authentication issue via the API: it should be token-based (ie. OIDC) before we rely on having to expand the scope of an already complicated OAUTH server. Ideally, we shouldn't need to develop additions for LDAP or AD authentication. We also shouldn't be using OneDrive at any step in this, unless I missed something.

jackmerrill commented 1 year ago

We also shouldn't be using OneDrive at any step in this, unless I missed something.

Was only an example.

it should be token-based

I'll look into this.

jackmerrill commented 1 year ago

RClone WebDAV has OIDC authentication. Also I found this: https://github.com/HarryKodden/encrypted-mounting

OR could we have a main server that everyone connects to, with whatever method we inevitably choose, and that main server connects to all the other servers via RClone?

Are there any mounting methods (i forgot the term, but like Samba, SMB, WebDAV, etc) that support OIDC/LDAP/OAuth/et al. authentication?

doamatto commented 1 year ago

No file sharing protocol to my knowledge, bar proprietary ones like "Google Drive" or "OneDrive", support anything other than basic authentication. It's the web interface where basic authentication is insufficient. Every official service, frankly any service in Farer, should use SSO, not basic authentication. This is especially the case where we have a say in it, like a website, like a web UI. An API can use SSO since a client ID and client secret can be used to gather a token for token-based authentication based on SSO.

jackmerrill commented 1 year ago

No file sharing protocol to my knowledge, bar proprietary ones like "Google Drive" or "OneDrive", support anything other than basic authentication.

Well, as I stated before, WebDAV has OIDC support apparently.

Regardless, how idiot-proof do we want / need to make this? Could a local Web UI work, or does it have to be like https://drive.fa? Would a dummy username/password system work? Where a user signs into the web ui, then they can get some generated credentials to use for authentication with whatever their OS supports?

Any other ideas?

doamatto commented 1 year ago

Logging into WebDAV does for Rsync. As for actual clients and servers, I don't think I've ever seen such a thing for any part of the WebDAV standard (including Cal and Card). App-based passwords are fine, and likely necessary.

A local web UI = an Electron app, which doesn't make sense when we can make things native ourselves.

I'm just failing to see how Rsync can be used as the web UI, the server, and the syncing method. Perhaps a flowchart is in order?

jackmerrill commented 1 year ago

Web UI: For setup and management Server: Single source. For example, we only set up one (or a few) main node[s], and connects to the storage servers, then it mounts that to something all users can connect to (NFS, etc) Syncing: See above.

doamatto commented 1 year ago

So this is meant for just mirroring data, not the hosting of or access to data.

jackmerrill commented 1 year ago

Which in that case, if it's only to be used by staff, Rclone is probably what we want.

If this is something user facing, we can still use Rclone but we'd need to work around some stuff.

doamatto commented 1 year ago

It wouldn't be exclusive to staff— individuals can help expand the storage we have and build resilience at their want.

jackmerrill commented 1 year ago

Would users be required to contact a staff member to add to the network? Would users be able to use their added storage, along with the entire storage network?

For example, if we did the Server / single source method, we'd just update the config(s) with the new server.

I'm really thinking the single source method is the best bet to have the auth method we want.

doamatto commented 1 year ago

Users wouldn't need to, they would just follow the setup instructions and be done. Pooling storage that is encrypted and not having them hold any of the decryption keys means that there is virtually no security worry, per se.

A "single-source" is necessary for any kind of traditional replication.

jackmerrill commented 1 year ago

So is this for pooling storage (i.e. 20gb on Server A and 40gb on Server B turns into 60gb on Server C), or for mirroring data around the world?

doamatto commented 1 year ago

Both, ideally and realistically. If there's more storage that can be used by users, it should be made available. And, naturally, mirroring is a general form of backup that is necessary.

jackmerrill commented 1 year ago

So hows this:

We have two "folders", mirror and pool, anything put in mirror will be mirrored across all available (storage wise, primarily) servers, and anything put in pool is pooled across all available servers.

doamatto commented 1 year ago

But how would a server know if it can mirror to other services? Or if there are conflicts between mirrored data from another server and pooled data from the current server? It seems fragile and out of scope of rsync. If you think a POC is viable, then go right ahead— I just don't see it personally.

jackmerrill commented 1 year ago

But how would a server know if it can mirror to other services

It'll find out the hard way I guess. If it can't mirror (storage, down, etc) then it'll simply fail.

Or if there are conflicts between mirrored data from another server and pooled data from the current server?

Maybe we only have one R/W mirror as the authoritative server? Each server will have two directories in their data share- /mirror and /pool, so there shouldn't be any conflicts between mirrored data and pooled data. All pooled data will be in a user's own folder, so no conflicts should arise between other users' data. Of course, all data, at least pooled data, will be encrypted.

I'll see if I can work out a PoC.

doamatto commented 1 year ago

then it'll simply fail.

This is VERY dangerous and not something acceptable unless it does so gracefully. If I'm moving a copy of a vital document, say a nationality document, but it only copies in part, with both servers crapping the toilet, there is a significant process to re-obtain that digital copy of the document. Naturally an extreme scenario, but all data should be treated with the care that we would for utmost vital documents.

doamatto commented 1 year ago

https://github.com/LLEB-ME/rsync-poc

Can I get a general timeframe for a PoC? And, as a side note, please use Ansible for the "build" process so that it is simple to deploy internally and relatively easy for end-users to deploy to their own servers, as well.

jackmerrill commented 1 year ago

I think there was a miscommunication- I'm using rclone not rsync.

Anyways here's the PoC with rclone:

[test-on-vps]
type = sftp
host = ...
pass = ...
user = root
shell_type = unix
md5sum_command = md5sum
sha1sum_command = sha1sum

[drive-test]
type = drive
client_id = ...
client_secret = ...
token = ...
team_drive = 

[mirror]
type = local

[pool]
type = union
upstreams = drive-test:/ test-on-vps:/

[test]
type = combine
upstreams = mirror=mirror:/ pool=pool-enc:/

[pool-enc]
type = crypt
password = xxxxxx
remote = pool:/rclone/jack

mirror would be synced with rclone sync on a cron job.

Regarding the failing thing, I really have no idea how rclone works for that, so some more research is due for that.

jackmerrill commented 1 year ago

After lots of internal conversations, we have a few options:

I am aiming towards Ceph, as it has LDAP/SAML2.0 authentication built-in and can do everything we want.

doamatto commented 1 year ago

We've had some issues with Ceph and, by extension, Rook in the past, but it's an avenue that most everyone is comfortable with. Do you think you can a POC and see which avenue is the best for moving forward?

jackmerrill commented 1 year ago

I have no idea how I'd make a POC with something like this? It might just be a case of "try and see".

doamatto commented 1 year ago

Important notes and takeaways from discussion in #int-general the other day:

We shouldn't use a chatbot for ideas (cough cough Jack)
IPFS is a generally okay solution, but its pitfalls haven't been explored enough to be able to confidently use it

jackmerrill commented 1 year ago

We shouldn't use a chatbot for ideas

i was testing okay

IPFS is a generally okay solution

Right, I wasn't wanting to use IPFS if possible. Seems to heavy, hence why Ceph, SeaFile, et. al are good options.

jackmerrill commented 1 year ago

If we're gonna use Ceph: https://www.marksei.com/how-to-install-ceph-with-ceph-ansible/

LLEB-ME / gouv.fa

RFC: Data Mirroring Script #26