Explore S3 Storage - Githubissues

ferd commented 1 year ago

See https://github.com/ferd/ReVault/issues/35

[x] Regroup all file management operations under a single module
[x] Scope out if all of these operations have an equivalent in S3 (for example, can we do cheap sha-256 checksumming of all files in all directories)
[X] Figure out if the cost of running that would be prohibitive or not
[x] Extract hash operations to be abstracted in such a way to possibly be efficient with S3
[x] Scan at the end of sync only if files changed locally
[x] Create an abstraction module that reimplements all file operations but with S3
[x] Add cache mode for S3 hash listing
[x] Make a global configuration mechanism for a node that substitutes file operations by s3 operations
[x] Hook in maestro for configuration
[x] See what happens
[x] Rework flow control
[x] Make flow control more robust
[x] Cover the TODO: test areas that required it to make the thing work end-to-end
[ ] ~Make configuration more robust~ <- Deferred to follow-up task, whenever I write docs? #38
[ ] ~Make other things more robust~ <- Deferred to future experience, everything's janky already
[ ] ~Try very large uploads maybe~ <- Deferred to follow-up task, needs end-to-end support #37
[x] Make mocked unit tests that don't cost money to run for S3 stuff

Also fixes a weird performance issue that came from loading deleted files in the directory diffing on the first load that got to be very slow on S3.

ferd commented 1 year ago

S3 Standard costs:

$0.005 for PUT, COPY, POST and LIST requests
$0.0004 for GET, SELECT, and all other requests.

Gotchas:

LIST can't return the checksums
LIST will require to be split in multiple paginated calls after ~3000 elements
easily getting the checksums can be done with a HEAD request with args, so long as said checksum was submitted
rename doesn't exist but can be approximated with COPY+DELETE

So with a directory of say 100 files, we would expect to have to do, for each sync:

a scan
a sync of all changed or updated file
a final scan

The scan is likely costliest because it will be repeated every time and in line with the directory size.

This implies doing:

a list of the directory ($0.005)
for each file a HEAD request to get the hash (100*$0.0004=$0.04)

By doing it twice, we're expecting 10 cents per sync even if it's a no-op.

Specific optimizations that could be done but alter the workflow:

only sync at the end if changes were detected (cuts cost in half)
cache the hashes and pair them with the item name and a lastmodified stamp, which lets us re-fetch only changed hashes or new files.

The latter is a worthy approach that could further cut costs almost tenfold (for 100 files; it's less for fewer files and more for more files), but requires moving file abstractions a different way for the scan, which is part of the core tracking mechanism.

Without it, it's unlikely to be economically worth it to store files on S3, so I'm adding it to the list.

Finally, S3 can avoid the "write to a tmp file and move" part and can keep that temporary file local no matter what, so the s3 module abstraction could hide the rename operation from a temporary dir to the final location as a single upload operation.

ferd commented 1 year ago

Trip report:

The fundamental operations work
There's more stuff to test (see TODOs), and real annoying behavior around trailing slashes and absolute paths in the config file when dealing with S3
There's a major flow control problem.

Over bigger directories, the bigger problem is that the data flow is fully asynchronous when sending from the fsm to the tls server handler, and from the tls client to the fsm.

This is generally not a problem when syncing because in normal mode, the local disk tends to be faster than the network over large file transfers. However, in S3 mode as I tested it (with two nodes on the same host), then suddenly the slowest link is now the network between the s3-handling node and aws s3 itself.

[disk] --> [local A] ---ext network----> [local B] --> [disk]
                           |
                           '- bottleneck

[disk] --> [local A] ---local network--> [local B] --> [s3]
                                                    |
                                        bottleneck -'

This appears to lead to all the reading from local disk to be done at once and transfered over the loopback interface to the s3-handling node, which bubbles it all up in memory until it sort of freezes over and the node becomes unresponsive.

It's gonna be time to start scheduling file transfers and flow control better.

ferd commented 1 year ago

Other gotcha to figure out: how to deal with invalid file names that are accepted by file systems: https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html

ferd commented 1 year ago

Currently getting some awfully long time on manifest diffing when the S3 node is the one doing the work

This happens only on the first sync, so there's a costly activity somewhere.

ferd / ReVault

Explore S3 Storage #36