ferd / ReVault

ReVault is a peer-to-peer self-hosted file synchronization project.
GNU Lesser General Public License v3.0
50 stars 2 forks source link

Explore S3 Storage #36

Closed ferd closed 1 year ago

ferd commented 1 year ago

See https://github.com/ferd/ReVault/issues/35

Also fixes a weird performance issue that came from loading deleted files in the directory diffing on the first load that got to be very slow on S3.

ferd commented 1 year ago

S3 Standard costs:

Gotchas:

So with a directory of say 100 files, we would expect to have to do, for each sync:

  1. a scan
  2. a sync of all changed or updated file
  3. a final scan

The scan is likely costliest because it will be repeated every time and in line with the directory size.

This implies doing:

  1. a list of the directory ($0.005)
  2. for each file a HEAD request to get the hash (100*$0.0004=$0.04)

By doing it twice, we're expecting 10 cents per sync even if it's a no-op.

Specific optimizations that could be done but alter the workflow:

The latter is a worthy approach that could further cut costs almost tenfold (for 100 files; it's less for fewer files and more for more files), but requires moving file abstractions a different way for the scan, which is part of the core tracking mechanism.

Without it, it's unlikely to be economically worth it to store files on S3, so I'm adding it to the list.

Finally, S3 can avoid the "write to a tmp file and move" part and can keep that temporary file local no matter what, so the s3 module abstraction could hide the rename operation from a temporary dir to the final location as a single upload operation.

ferd commented 1 year ago

Trip report:

Over bigger directories, the bigger problem is that the data flow is fully asynchronous when sending from the fsm to the tls server handler, and from the tls client to the fsm.

This is generally not a problem when syncing because in normal mode, the local disk tends to be faster than the network over large file transfers. However, in S3 mode as I tested it (with two nodes on the same host), then suddenly the slowest link is now the network between the s3-handling node and aws s3 itself.

[disk] --> [local A] ---ext network----> [local B] --> [disk]
                           |
                           '- bottleneck

[disk] --> [local A] ---local network--> [local B] --> [s3]
                                                    |
                                        bottleneck -'

This appears to lead to all the reading from local disk to be done at once and transfered over the loopback interface to the s3-handling node, which bubbles it all up in memory until it sort of freezes over and the node becomes unresponsive.

It's gonna be time to start scheduling file transfers and flow control better.

ferd commented 1 year ago

Other gotcha to figure out: how to deal with invalid file names that are accepted by file systems: https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html

ferd commented 1 year ago

Currently getting some awfully long time on manifest diffing when the S3 node is the one doing the work

image

This happens only on the first sync, so there's a costly activity somewhere.