Retry or repair failed put of chunks?

AOOSG commented 5 months ago

Hi, I seem to recently have gotten quite frequent failures while uploading a UE5.4 build to an AWS S3 bucket. It seemed to start happening without any changes to my system. The errors are similar to this:

ERRO putStoredBlock: failed to put stored block at `chunks/9f85/0x9f854a10bfd90a9d.lsb` in `s3://<my-aws-bucket>/store/`: s3BlobObject.Write(): operation error S3: PutObject, https response error StatusCode: 0, RequestID: , HostID: , request send failed, Put "https://<my-aws-bucket>.s3.eu-west-2.amazonaws.com/store/chunks/9f85/0x9f854a10bfd90a9d.lsb?x-id=PutObject": dial tcp: lookup <my-aws-bucket>.s3.eu-west-2.amazonaws.com: no such host  blobClient="s3://<my-aws-bucket>/store/" blockHash=11494675059731663517 fname=putStoredBlock key=chunks/9f85/0x9f854a10bfd90a9d.lsb s="s3://<my-aws-bucket>/store/"

I almost always get two failures during an upload, for different chunks. I had 6 chunks failing to upload once.

Is there a known fix for "no such host" errors, or can I retry failed blocks manually somehow? I'd like to just retry much more persistently to see if it succeeds eventually.

The worst part is that the bucket seems corrupted once this error occurs and I have to delete all objects in the bucket before trying to upload a new engine version again.

There's a couple of things I've tried:

Searched the web for a fix - No obvious things (e.g. link and link)
Using google cloud bucket instead - similar errors occur, a couple of errors usually.
Using a different DNS server, but it didn't seem to help (e.g. 8.8.8.8)
Updated to latest golongtail preview release.

Any other suggestions, or ways to repair failed uploads of chunks? Thanks!

AOOSG commented 5 months ago

Some more things I tried last night:

I can keep retrying put of the engine build. It keeps uploading chunks not already uploaded whilst I still get errors on new chunks. Eventually it'll succeed with the upload.

This sounds promising, but doing a get on the same engine build will show lots of warnings and errors on missing chunks, it seems the store is corrupted, and the only option is to delete all the files in the bucket.

Is there a way to repair the store? I fear any future get in future uploaded builds will also fail because of chunks possibly being shared between builds (and the chunks may be missing).

AOOSG commented 5 months ago

Further testing, I think I know what's going on. I managed to put an engine build successfully.

It looks like golongtail is so efficient at maximizing the upload bandwidth that sometimes the DNS resolve requests from the golongtail app times out!

On the router I limited the machine upload speed to 90 Mbit/s (It's a ~120 Mbit/s upload internet connection) and I managed to successfully upload an engine build without errors.

So it seems solved for now, two questions/comments:

Feature request: Allow me to throttle the put upload bandwidth usage (for example by allowing me to limit the number of threads used?)
If errors do happen again during a new engine build put it seems there's no guarantee future engine builds won't be corrupted due to possible missing chunks that are shared. Is this true, and how can I fix or mitigate it?

DanEngelbrecht commented 5 months ago

You can indirectly reduce the number of threads doing network jobs via the --worker-count option. It defaults to the number of CPU cores you have so if your machine have lots of cores it might be helpful to reduce the number.

For corrupted upsync it is designed so it should not write the version-index if if fails to upload blocks but there might be something wrong there.

You have the option to add --validate which will check that the uploaded index is correct, but won't help if the blocks gets corrupted on the way up to your storage medium...

DanEngelbrecht commented 5 months ago

You might also be able to do some form of repair with either --clone-store or --prune-store but there is no outright repair command.

AOOSG commented 5 months ago

OK thanks, feel free to close this issue.

I'll give these commands a try the next time. Upload issue should be fixed by reducing the worker count.

I should be able to run --prune-store-blocks if a put fails to remove all blocks which isn't in the index (which should happen if chunks fail to upload).

Two other things I'm thinking of as well:

golongtail put only writes to a local store on the server, and use rsync to the bucket to ensure the bucket is up to date. Downsides: Keeps around a local copy of the whole store on the server.
If there's an error trying to upload a chunk, change the store location in the bucket (i.e. gs://bucket/store<index>/). index is bumped each time there's an engine build that fails to upload one or more chunks. Downsides: Lose all block re-use when bumping the index.

DanEngelbrecht commented 4 months ago

@AOOSG Out of curiosity - what kind of computer and NIC do you have?

AOOSG commented 4 months ago

From the two machines I've tried (on the same network, 1Gbps down & 120Mbps up)

Desktop 5950X (32 threads) and a dodgy WIFI that still maxes the upload speed.
Server i9-9900k (16 threads) Gigabit ethernet to router

Both were having DNS resolve timing out before I bandwidth limited them to 90Mbps upload.

DanEngelbrecht commented 4 months ago

Hi, could you try out https://github.com/DanEngelbrecht/golongtail/releases/tag/v0.4.4-pre1 without limiting the number of workers? 1) Does it create corrupted stores? 2) Does performance still look ok?

AOOSG commented 3 months ago

Thanks, given it a try now, it's worked so far.

Not that I've seen from a couple of times
Yep, I used the default 8 threads and it maxed 120 Mbps.

DanEngelbrecht / golongtail

Retry or repair failed put of chunks? #257