determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
2.99k stars 348 forks source link

🐛[bug] Tensoboard event file upload failed. #5156

Closed sijin-dm closed 1 year ago

sijin-dm commented 1 year ago

Describe the bug

After upgrade to 0.19.4, tensorboard event files are failed to upload to s3 checkpoint bucket after serveral epoch, error code is attached in screenshot section.

Reproduction Steps

  1. Upgrade determined from 0.18.4 to 0.19.4.
  2. Save images and scalars with TorchWriter
    
    from determined.tensorboard.metric_writers.pytorch import TorchWriter
    logger = TorchWriter() # 

Save to torchwritter every 200 iterations.

logger.add_image(name, image, global_step) logger.add_scalar(name, value, global_step)



### Expected Behavior

No error.

### Screenshot

[2022-09-29 16:42:27] [c193c38a] [rank=0] Traceback (most recent call last):
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]     return _run_code(code, main_globals, None,
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]     exec(code, run_globals)
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 132, in <module>
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]     sys.exit(main(args.train_entrypoint))
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 123, in main
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]     controller.run()
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 274, in run
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]     self._run()
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 342, in _run
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]     self.upload_tb_files()
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/_trial_controller.py", line 117, in upload_tb_files
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]     self.context._core.train.upload_tensorboard_files(
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_train.py", line 127, in upload_tensorboard_files
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]     self._tensorboard_manager.sync(selector, mangler, self._distributed.rank)
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/util.py", line 80, in wrapped
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]     return fn(*arg, **kwarg)
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/tensorboard/s3.py", line 64, in sync
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]     self.client.upload_file(str(path), self.bucket, key_name)
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]   File "/opt/conda/lib/python3.8/site-packages/boto3/s3/inject.py", line 131, in upload_file
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]     return transfer.upload_file(
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]   File "/opt/conda/lib/python3.8/site-packages/boto3/s3/transfer.py", line 293, in upload_file
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0]     raise S3UploadFailedError(
<none> [2022-09-29 16:42:27] [c193c38a] [rank=0] boto3.exceptions.S3UploadFailedError: Failed to upload /tmp/tensorboard-729.d7a76451-81d9-49e4-b2b2-61d46293cf29.6-0/events.out.tfevents.1664468248.exp-729-trial-725-0-729.d7a76451-81d9-49e4-b2b2-61d46293cf29.6.390.0 to ml-checkpoint/51085318-5dd5-45c2-81fd-d3ad495f541c/tensorboard/experiment/729/trial/725/events.out.tfevents.1664468248.exp-729-trial-725-0-729.d7a76451-81d9-49e4-b2b2-61d46293cf29.6.390.0: An error occurred (NoSuchUpload) when calling the UploadPart operation: The specified multipart upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.
<none> [2022-09-29 16:42:29] [c193c38a] Process 0 exit with status code 1.
<none> [2022-09-29 16:42:29] [c193c38a] Terminating remaining workers after failure of Process 0.

### Environment

- Device or hardware: Nvidia A100 * 10
- Environment: Kubernetes 

### Additional Context

_No response_
rb-determined-ai commented 1 year ago

So to be clear, things are working for a long while, and then suddenly they fail?

Have you seen this happen just one time, or does it happen every time?

sijin-dm commented 1 year ago

So to be clear, things are working for a long while, and then suddenly they fail?

Have you seen this happen just one time, or does it happen every time?

You are right, things are fine in the very begining epoches.

It happened every time we use the TorchWriter to save scalars and images. But the experiment becomes normal when we do not use it. And the same codes work fine in 0.18.4.

rb-determined-ai commented 1 year ago

Created a ticket to track this internally. We'll get somebody assigned and get to the bottom of this.

sijin-dm commented 1 year ago

Is there any update ? @rb-determined-ai :)

mpkouznetsov commented 1 year ago

I am looking at it now.

mpkouznetsov commented 1 year ago

So far, I was not able to reproduce this but I will keep trying. Can you check the size of the object on S3 involved in the upload: ml-checkpoint/51085318-5dd5-45c2-81fd-d3ad495f541c/tensorboard/experiment/729/trial/725/events.out.tfevents.1664468248.exp-729-trial-725-0-729.d7a76451-81d9-49e4-b2b2-61d46293cf29.6.390.0 ?

Also, if possible,

sijin-dm commented 1 year ago

So far, I was not able to reproduce this but I will keep trying. Can you check the size of the object on S3 involved in the upload: ml-checkpoint/51085318-5dd5-45c2-81fd-d3ad495f541c/tensorboard/experiment/729/trial/725/events.out.tfevents.1664468248.exp-729-trial-725-0-729.d7a76451-81d9-49e4-b2b2-61d46293cf29.6.390.0 ?

Also, if possible,

  • what is the size of the image tensor or grid being saved?
  • how many times the download succeeds before it fails?
  • is versioning enabled on the bucket ml-checkpoint? Also, I assume that in the snippet above you are actually calling self.logger.writer.add_image(name, image, global_step) and not self.logger.add_image(name, image, global_step).

The tfevent files failed to upload, so there is no path ml-checkpoint/51085318-5dd5-45c2-81fd-d3ad495f541c/tensorboard/experiment/729 being created in S3. I find out another failed experiment, the object files are shown in the following picture. I think the objects' size is similar. image

Moreover,

  1. Image tensor size is 576x768. image

  2. I think it is upload succeed instead of download? I have no idea how get the times.

  3. I think the bucket ml-checkpoint is versioning, there are a lot of different version under ml-checkpoint. BTW, we use minio as s3 object storage.

  4. You are right, it is self.logger.writer.add_image. : )

mpkouznetsov commented 1 year ago
  1. Yes, I meant upload.

I was not able to reproduce your issue with RELEASE.2022-09-25T15-44-53Z of minio. What is your version? On the positive side, assuming that the tfevents size file is the problem (not 100% sure it is, since I don't have a reproduction) we may have a hotfix for you in the next day or two.

mpkouznetsov commented 1 year ago

Please let us know which version of minio you are using.

Until we come up with a more permanent solution, would you mind trying the following workaround:

Close the writer every time after adding the scalar (or scalars) and the image (images).

# Save to torchwriter every 200 iterations.
logger.writer.add_image(name, image, global_step)
logger.writer.add_scalar(name, value, global_step)
logger.writer.close()

Note: PyTorch SummaryWriter (that is what logger.writer is) can be reused after it is closed. It will automatically reopen all underlying file writers on the next write.

Also, please add a startup-hook.sh to your experiment directory with the following content (or append this content to an existing startup-hook.sh:

file="$(python3 -c 'import determined.tensorboard.metric_writers.pytorch as x; print(x.__file__)')"
sed -i -e '/^        if "flush" in dir/d' "$file"
sed -i -e 's/    self\.writer\.flush/self.writer.close/g' "$file"
cat $file

This would patch our internal writer to do the same thing (close instead of flush).

sijin-dm commented 1 year ago

Please let us know which version of minion you are using.

Until we come up with a more permanent solution, would you mind trying the following workaround:

Close the writer every time after adding the scalar (or scalars) and the image (images).

# Save to torchwriter every 200 iterations.
logger.writer.add_image(name, image, global_step)
logger.writer.add_scalar(name, value, global_step)
logger.writer.close()

Note: PyTorch SummaryWriter (that is what logger.writer is) can be reused after it is closed. It will automatically reopen all underlying file writers on the next write.

Also, please add a startup-hook.sh to your experiment directory with the following content (or append this content to an existing startup-hook.sh:

file="$(python3 -c 'import determined.tensorboard.metric_writers.pytorch as x; print(x.__file__)')"
sed -i -e '/^        if "flush" in dir/d' "$file"
sed -i -e 's/    self\.writer\.flush/self.writer.close/g' "$file"
cat $file

This would patch our internal writer to do the same thing (close instead of flush).

We use 8.0.10 of minio, and install by helm chart:

- name: minio
    version: 8.0.10
    repository: https://helm.min.io/
    condition: minio.install

We will try your workaround, thanks!

mpkouznetsov commented 1 year ago

Did you get a chance to try the workaround?

sijin-dm commented 1 year ago

Did you get a chance to try the workaround?

We are trying about 3 days ago, and so far so good. By the way, the same error has happened in every experiment (different tasks and different repo) accidentally. We are using this workaound for every experiment now, even though they may not use TorchWriter. We want to test it longer, and I will update the test result here after maybe one or two weeks if everything is alright.

sijin-dm commented 1 year ago

After testing the workaround for more than one week with 4 different tasks, it works well. Thank you @mpkouznetsov ! So would you update it to master and release in the future?

mpkouznetsov commented 1 year ago

@sijin-dm , thank you for reporting this. Yes, the partial fix in on master, and that has been released this week (closing the instance of TorchWriter used by our Trial API to report metrics). However, we cannot close the instances of TorchWriter that you create yourself so the workaround I suggested is the only solution for now.

We will revisit our TensorBoard support in the near future but have not scheduled this work yet.