backube / scribe

Asynchronous data replication for Kubernetes CSI storage
https://scribe-replication.readthedocs.io
GNU Affero General Public License v3.0
70 stars 18 forks source link

Demo: sync data into kube #65

Closed JohnStrunk closed 3 years ago

JohnStrunk commented 3 years ago

Describe the feature you'd like to have. We should put together a demo showing how to use Scribe to sync data into a kube cluster.

What is the value to the end user? (why is it a priority?)

How will we know we have a good solution? (acceptance criteria)

Additional context

cooktheryan commented 3 years ago

with this are we expecting the external storage to be a storageclass within a cluster or completely external? @JohnStrunk @screeley44

screeley44 commented 3 years ago

I think external, to show how Scribe can help get your data into a kube environment

JohnStrunk commented 3 years ago

Completely external.

Consider: The IT department has a project to move application X from their legacy infrastructure into their shiny new Kubernetes environment.

We should be able to run periodic syncs (for staging/testing) prior to the final switchover. (I'm thinking of a cron entry on the external infra to drive this)

cooktheryan commented 3 years ago

Potential spec I am thinking @backube/scribemaintainers

---
apiVersion: scribe.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: database-source
  namespace: source
spec:
  trigger:
    schedule: "*/3 * * * *"
  external:
    address: my.host.com
    sshKey: scribe-rsync-dest-src-database-destination
    storageSecret: secret (optional tls values)
    sourceType: gluster
    path: /brick1
    storageAddress: xxx.xxx.xxx.xxx

---
apiVersion: scribe.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: database-source
  namespace: source
spec:
  sourcePVC: mysql-pv-claim
  trigger:
    schedule: "*/3 * * * *"
  external:
    address: my.host.com
    sshKey: scribe-rsync-dest-src-database-destination
    storageSecret: secret (ceph.conf + keyring)
    sourceType: cephrbd or cephfs
    storageAddress: xxx.xxx.xxx.xxx
    path: /cephrbd

# Stretch Goal
---
apiVersion: scribe.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: database-source
  namespace: source
spec:
  trigger:
    schedule: "*/3 * * * *"
  external:
    sshKeys: scribe-rsync-dest-src-database-destination
    storageAddress: xxx.xxx.xxx.xxx
    sourceType: SSH
    path: /var/www/html
    address: my.host.com
cooktheryan commented 3 years ago

For the destination, we most likely could get away with almost all of the same parameters that rsync operates with today.

I initially thought about having the source perform all of the work but I worry it gets away from our current source and destination models and would potentially require some code reworks.

JohnStrunk commented 3 years ago

I was hoping this wouldn't require any changes to the CR or operator... We could have a script/binary that runs on the external infrastructure and plays the role of the Source or Destination. Since it's just rsync over an ssh connection, those are commonly available across most platforms. Imagine a script like:

./scribe-source --source /my/local/data --destination elb.cluster.com:22 --local-key my-ssh-key --remote-key other-key.pub

... that would connect to the Service created by a ReplicationDestination. This script could be triggered via at or cron to create periodic syncs. The local/remote keys would need to agree with the secretRef in the corresponding ReplicationDestination (either autogenerated or manually created would both work).


Moving data out may be even easier given the ReplicationSource for rsync. Assuming the external system has an ssh server:

apiVersion: scribe.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: replicationsource-sample
spec:
  sourcePVC: pvcname
  trigger:
    schedule: "0 * * * *"  # hourly
  rsync:
    copyMethod: Clone
    sshKeys: secretRef
    address: my.external-system.com  # the external host that mounts the storage
    port: 22  # port that runs sshd
    sshUser: myusername  # username to use when connecting to the remote system

The above is generic, so we don't need to care what storage we're migrating from/to.

cooktheryan commented 3 years ago

The binary would be interesting

@JohnStrunk with the YAML above though how would we handle landing the data and retaining it with the current way items are cleaned up after the source run.

JohnStrunk commented 3 years ago

@JohnStrunk with the YAML above though how would we handle landing the data and retaining it with the current way items are cleaned up after the source run.

I'm not sure I understand... The Source would work just like it always does... clone => rsync => delete. The next iteration would do the same. Rsync has no problem diff-ing even though it's a different source PV.

If you're referring to the lack of snapshot ability on external systems, I think there's room for some improvement there:

cooktheryan commented 3 years ago

With the storage being external a sourcePVC wouldn't exist which would stop us from clone or snapshot at the ReplicationSource side.

I do like the strategy of the binary but I don't know how we can use the replicationsource or if we actually need it at all since the replicationdestination would give us our ELB and create the pvc or Snapshot.