Open AOOSG opened 5 months ago
Some more things I tried last night:
I can keep retrying put
of the engine build. It keeps uploading chunks not already uploaded whilst I still get errors on new chunks. Eventually it'll succeed with the upload.
This sounds promising, but doing a get
on the same engine build will show lots of warnings and errors on missing chunks, it seems the store is corrupted, and the only option is to delete all the files in the bucket.
Is there a way to repair the store? I fear any future get
in future uploaded builds will also fail because of chunks possibly being shared between builds (and the chunks may be missing).
Further testing, I think I know what's going on. I managed to put
an engine build successfully.
It looks like golongtail
is so efficient at maximizing the upload bandwidth that sometimes the DNS resolve requests from the golongtail
app times out!
On the router I limited the machine upload speed to 90 Mbit/s (It's a ~120 Mbit/s upload internet connection) and I managed to successfully upload an engine build without errors.
So it seems solved for now, two questions/comments:
put
upload bandwidth usage (for example by allowing me to limit the number of threads used?)put
it seems there's no guarantee future engine builds won't be corrupted due to possible missing chunks that are shared. Is this true, and how can I fix or mitigate it?You can indirectly reduce the number of threads doing network jobs via the --worker-count
option. It defaults to the number of CPU cores you have so if your machine have lots of cores it might be helpful to reduce the number.
For corrupted upsync it is designed so it should not write the version-index if if fails to upload blocks but there might be something wrong there.
You have the option to add --validate
which will check that the uploaded index is correct, but won't help if the blocks gets corrupted on the way up to your storage medium...
You might also be able to do some form of repair with either --clone-store
or --prune-store
but there is no outright repair
command.
OK thanks, feel free to close this issue.
I'll give these commands a try the next time. Upload issue should be fixed by reducing the worker count.
I should be able to run --prune-store-blocks
if a put
fails to remove all blocks which isn't in the index (which should happen if chunks fail to upload).
Two other things I'm thinking of as well:
golongtail put
only writes to a local store on the server, and use rsync to the bucket to ensure the bucket is up to date. Downsides: Keeps around a local copy of the whole store
on the server.gs://bucket/store<index>/
). index
is bumped each time there's an engine build that fails to upload one or more chunks. Downsides: Lose all block re-use when bumping the index.@AOOSG Out of curiosity - what kind of computer and NIC do you have?
From the two machines I've tried (on the same network, 1Gbps down & 120Mbps up)
Both were having DNS resolve timing out before I bandwidth limited them to 90Mbps upload.
Hi, could you try out https://github.com/DanEngelbrecht/golongtail/releases/tag/v0.4.4-pre1 without limiting the number of workers? 1) Does it create corrupted stores? 2) Does performance still look ok?
Thanks, given it a try now, it's worked so far.
Hi, I seem to recently have gotten quite frequent failures while uploading a UE5.4 build to an AWS S3 bucket. It seemed to start happening without any changes to my system. The errors are similar to this:
I almost always get two failures during an upload, for different chunks. I had 6 chunks failing to upload once.
Is there a known fix for "no such host" errors, or can I retry failed blocks manually somehow? I'd like to just retry much more persistently to see if it succeeds eventually.
The worst part is that the bucket seems corrupted once this error occurs and I have to delete all objects in the bucket before trying to upload a new engine version again.
There's a couple of things I've tried:
Any other suggestions, or ways to repair failed uploads of chunks? Thanks!