Closed sijin-dm closed 1 year ago
So to be clear, things are working for a long while, and then suddenly they fail?
Have you seen this happen just one time, or does it happen every time?
So to be clear, things are working for a long while, and then suddenly they fail?
Have you seen this happen just one time, or does it happen every time?
You are right, things are fine in the very begining epoches.
It happened every time we use the TorchWriter to save scalars and images. But the experiment becomes normal when we do not use it. And the same codes work fine in 0.18.4.
Created a ticket to track this internally. We'll get somebody assigned and get to the bottom of this.
Is there any update ? @rb-determined-ai :)
I am looking at it now.
So far, I was not able to reproduce this but I will keep trying. Can you check the size of the object on S3 involved in the upload: ml-checkpoint/51085318-5dd5-45c2-81fd-d3ad495f541c/tensorboard/experiment/729/trial/725/events.out.tfevents.1664468248.exp-729-trial-725-0-729.d7a76451-81d9-49e4-b2b2-61d46293cf29.6.390.0
?
Also, if possible,
ml-checkpoint
?
Also, I assume that in the snippet above you are actually calling
self.logger.writer.add_image(name, image, global_step)
and not
self.logger.add_image(name, image, global_step)
.So far, I was not able to reproduce this but I will keep trying. Can you check the size of the object on S3 involved in the upload:
ml-checkpoint/51085318-5dd5-45c2-81fd-d3ad495f541c/tensorboard/experiment/729/trial/725/events.out.tfevents.1664468248.exp-729-trial-725-0-729.d7a76451-81d9-49e4-b2b2-61d46293cf29.6.390.0
?Also, if possible,
- what is the size of the image tensor or grid being saved?
- how many times the download succeeds before it fails?
- is versioning enabled on the bucket
ml-checkpoint
? Also, I assume that in the snippet above you are actually callingself.logger.writer.add_image(name, image, global_step)
and notself.logger.add_image(name, image, global_step)
.
The tfevent files failed to upload, so there is no path ml-checkpoint/51085318-5dd5-45c2-81fd-d3ad495f541c/tensorboard/experiment/729
being created in S3. I find out another failed experiment, the object files are shown in the following picture. I think the objects' size is similar.
Moreover,
Image tensor size is 576x768.
I think it is upload succeed instead of download? I have no idea how get the times.
I think the bucket ml-checkpoint
is versioning, there are a lot of different version under ml-checkpoint
. BTW, we use minio as s3 object storage.
You are right, it is self.logger.writer.add_image
. : )
I was not able to reproduce your issue with RELEASE.2022-09-25T15-44-53Z of minio. What is your version? On the positive side, assuming that the tfevents size file is the problem (not 100% sure it is, since I don't have a reproduction) we may have a hotfix for you in the next day or two.
Please let us know which version of minio you are using.
Until we come up with a more permanent solution, would you mind trying the following workaround:
Close the writer every time after adding the scalar (or scalars) and the image (images).
# Save to torchwriter every 200 iterations.
logger.writer.add_image(name, image, global_step)
logger.writer.add_scalar(name, value, global_step)
logger.writer.close()
Note: PyTorch SummaryWriter (that is what logger.writer
is) can be reused after it is closed. It will automatically reopen all underlying file writers on the next write.
Also, please add a startup-hook.sh
to your experiment directory with the following content (or append this content to an existing startup-hook.sh
:
file="$(python3 -c 'import determined.tensorboard.metric_writers.pytorch as x; print(x.__file__)')"
sed -i -e '/^ if "flush" in dir/d' "$file"
sed -i -e 's/ self\.writer\.flush/self.writer.close/g' "$file"
cat $file
This would patch our internal writer to do the same thing (close instead of flush).
Please let us know which version of minion you are using.
Until we come up with a more permanent solution, would you mind trying the following workaround:
Close the writer every time after adding the scalar (or scalars) and the image (images).
# Save to torchwriter every 200 iterations. logger.writer.add_image(name, image, global_step) logger.writer.add_scalar(name, value, global_step) logger.writer.close()
Note: PyTorch SummaryWriter (that is what
logger.writer
is) can be reused after it is closed. It will automatically reopen all underlying file writers on the next write.Also, please add a
startup-hook.sh
to your experiment directory with the following content (or append this content to an existingstartup-hook.sh
:file="$(python3 -c 'import determined.tensorboard.metric_writers.pytorch as x; print(x.__file__)')" sed -i -e '/^ if "flush" in dir/d' "$file" sed -i -e 's/ self\.writer\.flush/self.writer.close/g' "$file" cat $file
This would patch our internal writer to do the same thing (close instead of flush).
We use 8.0.10 of minio, and install by helm chart:
- name: minio
version: 8.0.10
repository: https://helm.min.io/
condition: minio.install
We will try your workaround, thanks!
Did you get a chance to try the workaround?
Did you get a chance to try the workaround?
We are trying about 3 days ago, and so far so good. By the way, the same error has happened in every experiment (different tasks and different repo) accidentally. We are using this workaound for every experiment now, even though they may not use TorchWriter
. We want to test it longer, and I will update the test result here after maybe one or two weeks if everything is alright.
After testing the workaround for more than one week with 4 different tasks, it works well. Thank you @mpkouznetsov ! So would you update it to master and release in the future?
@sijin-dm , thank you for reporting this. Yes, the partial fix in on master, and that has been released this week (closing the instance of TorchWriter used by our Trial API to report metrics). However, we cannot close the instances of TorchWriter that you create yourself so the workaround I suggested is the only solution for now.
We will revisit our TensorBoard support in the near future but have not scheduled this work yet.
Describe the bug
After upgrade to 0.19.4, tensorboard event files are failed to upload to s3 checkpoint bucket after serveral epoch, error code is attached in screenshot section.
Reproduction Steps
Save to torchwritter every 200 iterations.
logger.add_image(name, image, global_step) logger.add_scalar(name, value, global_step)