Closed sm-ak-r33 closed 1 month ago
full error here:-
usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/parsing.py:199: Attribute 'loss' is an instance of nn.Module
and is already saved during checkpointing. It is recommended to ignore them using self.save_hyperparameters(ignore=['loss'])
.
/usr/lib/python3.10/subprocess.py:1796: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = _posixsubprocess.fork_exec(
2024-05-15 18:47:14,081 INFO worker.py:1749 -- Started a local Ray instance.
2024-05-15 18:47:15,647 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call ray.init(...)
before Tuner(...)
.
+--------------------------------------------------------------------+
| Configuration for experiment _train_tune_2024-05-15_18-47-11 |
+--------------------------------------------------------------------+
| Search algorithm BasicVariantGenerator |
| Scheduler FIFOScheduler |
| Number of trials 20 |
+--------------------------------------------------------------------+
View detailed results here: /root/ray_results/_train_tune_2024-05-15_18-47-11
To visualize your results with TensorBoard, run: tensorboard --logdir /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts
(_train_tune pid=1995) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=1995) Seed set to 1
(_train_tune pid=1995) GPU available: True (cuda), used: True
(_train_tune pid=1995) TPU available: False, using: 0 TPU cores
(_train_tune pid=1995) IPU available: False, using: 0 IPUs
(_train_tune pid=1995) HPU available: False, using: 0 HPUs
(_train_tune pid=1995) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=1995) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00000_0_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=1995) 2024-05-15 18:47:26.258578: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=1995) 2024-05-15 18:47:26.258632: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=1995) 2024-05-15 18:47:26.393991: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=1995) 2024-05-15 18:47:27.873538: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=1995) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=1995)
(_train_tune pid=1995) | Name | Type | Params
(_train_tune pid=1995) --------------------------------------------------
(_train_tune pid=1995) 0 | loss | MAE | 0
(_train_tune pid=1995) 1 | padder | ConstantPad1d | 0
(_train_tune pid=1995) 2 | scaler | TemporalNorm | 0
(_train_tune pid=1995) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=1995) 4 | context_adapter | Linear | 733 K
(_train_tune pid=1995) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=1995) --------------------------------------------------
(_train_tune pid=1995) 1.2 M Trainable params
(_train_tune pid=1995) 0 Non-trainable params
(_train_tune pid=1995) 1.2 M Total params
(_train_tune pid=1995) 4.880 Total estimated model params size (MB)
Sanity Checking: | | 0/? [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:47:30,505 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00000
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=1995, ip=172.28.0.12, actor_id=5c789dcf4a6d166909d4404d01000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00000 errored after 0 iterations at 2024-05-15 18:47:30. Total running time: 14s
Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00000_0_2024-05-15_18-47-16/error.txt
(_train_tune pid=2110) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=2110) Seed set to 1
(_train_tune pid=2110) GPU available: True (cuda), used: True
(_train_tune pid=2110) TPU available: False, using: 0 TPU cores
(_train_tune pid=2110) IPU available: False, using: 0 IPUs
(_train_tune pid=2110) HPU available: False, using: 0 HPUs
(_train_tune pid=2110) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=2110) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00001_1_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=2110) 2024-05-15 18:47:39.421817: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=2110) 2024-05-15 18:47:39.421885: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=2110) 2024-05-15 18:47:39.423953: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=2110) 2024-05-15 18:47:41.021553: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=2110) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=2110)
(_train_tune pid=2110) | Name | Type | Params
(_train_tune pid=2110) --------------------------------------------------
(_train_tune pid=2110) 0 | loss | MAE | 0
(_train_tune pid=2110) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2110) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2110) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=2110) 4 | context_adapter | Linear | 733 K
(_train_tune pid=2110) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=2110) --------------------------------------------------
(_train_tune pid=2110) 1.2 M Trainable params
(_train_tune pid=2110) 0 Non-trainable params
(_train_tune pid=2110) 1.2 M Total params
(_train_tune pid=2110) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:47:43,063 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00001
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2110, ip=172.28.0.12, actor_id=2838f0110175210a4229689701000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00001 errored after 0 iterations at 2024-05-15 18:47:43. Total running time: 27s
Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00001_1_2024-05-15_18-47-16/error.txt
(_train_tune pid=2198) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=2198) Seed set to 1
(_train_tune pid=2198) GPU available: True (cuda), used: True
(_train_tune pid=2198) TPU available: False, using: 0 TPU cores
(_train_tune pid=2198) IPU available: False, using: 0 IPUs
(_train_tune pid=2198) HPU available: False, using: 0 HPUs
(_train_tune pid=2198) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=2198) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00002_2_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=2198) 2024-05-15 18:47:50.035350: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=2198) 2024-05-15 18:47:50.035407: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=2198) 2024-05-15 18:47:50.037270: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=2198) 2024-05-15 18:47:51.939885: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=2198) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=2198)
(_train_tune pid=2198) | Name | Type | Params
(_train_tune pid=2198) --------------------------------------------------
(_train_tune pid=2198) 0 | loss | MAE | 0
(_train_tune pid=2198) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2198) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2198) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=2198) 4 | context_adapter | Linear | 733 K
(_train_tune pid=2198) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=2198) --------------------------------------------------
(_train_tune pid=2198) 1.2 M Trainable params
(_train_tune pid=2198) 0 Non-trainable params
(_train_tune pid=2198) 1.2 M Total params
(_train_tune pid=2198) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:47:54,727 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00002
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2198, ip=172.28.0.12, actor_id=c714b39740eeafb5a49fc4ae01000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00002 errored after 0 iterations at 2024-05-15 18:47:54. Total running time: 38s
Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00002_2_2024-05-15_18-47-16/error.txt
(_train_tune pid=2288) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=2288) Seed set to 1
(_train_tune pid=2288) GPU available: True (cuda), used: True
(_train_tune pid=2288) TPU available: False, using: 0 TPU cores
(_train_tune pid=2288) IPU available: False, using: 0 IPUs
(_train_tune pid=2288) HPU available: False, using: 0 HPUs
(_train_tune pid=2288) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=2288) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00003_3_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=2288) 2024-05-15 18:48:03.697816: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=2288) 2024-05-15 18:48:03.697868: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=2288) 2024-05-15 18:48:03.699360: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=2288) 2024-05-15 18:48:05.019658: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=2288) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=2288)
(_train_tune pid=2288) | Name | Type | Params
(_train_tune pid=2288) --------------------------------------------------
(_train_tune pid=2288) 0 | loss | MAE | 0
(_train_tune pid=2288) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2288) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2288) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=2288) 4 | context_adapter | Linear | 733 K
(_train_tune pid=2288) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=2288) --------------------------------------------------
(_train_tune pid=2288) 1.2 M Trainable params
(_train_tune pid=2288) 0 Non-trainable params
(_train_tune pid=2288) 1.2 M Total params
(_train_tune pid=2288) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:48:07,684 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00003
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2288, ip=172.28.0.12, actor_id=1d441b0c09369e41859fc92f01000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00003 errored after 0 iterations at 2024-05-15 18:48:07. Total running time: 51s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00003_3_2024-05-15_18-47-16/error.txt
(_train_tune pid=2381) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=2381) Seed set to 1
(_train_tune pid=2381) GPU available: True (cuda), used: True
(_train_tune pid=2381) TPU available: False, using: 0 TPU cores
(_train_tune pid=2381) IPU available: False, using: 0 IPUs
(_train_tune pid=2381) HPU available: False, using: 0 HPUs
(_train_tune pid=2381) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=2381) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00004_4_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=2381) 2024-05-15 18:48:16.700941: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=2381) 2024-05-15 18:48:16.701004: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=2381) 2024-05-15 18:48:16.702355: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=2381) 2024-05-15 18:48:18.058275: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=2381) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=2381)
(_train_tune pid=2381) | Name | Type | Params
(_train_tune pid=2381) --------------------------------------------------
(_train_tune pid=2381) 0 | loss | MAE | 0
(_train_tune pid=2381) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2381) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2381) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=2381) 4 | context_adapter | Linear | 733 K
(_train_tune pid=2381) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=2381) --------------------------------------------------
(_train_tune pid=2381) 1.2 M Trainable params
(_train_tune pid=2381) 0 Non-trainable params
(_train_tune pid=2381) 1.2 M Total params
(_train_tune pid=2381) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:48:20,080 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00004
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2381, ip=172.28.0.12, actor_id=5404bc2a6c8eaf14c899cd8b01000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00004 errored after 0 iterations at 2024-05-15 18:48:20. Total running time: 1min 4s
Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00004_4_2024-05-15_18-47-16/error.txt
(_train_tune pid=2469) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=2469) Seed set to 1
(_train_tune pid=2469) GPU available: True (cuda), used: True
(_train_tune pid=2469) TPU available: False, using: 0 TPU cores
(_train_tune pid=2469) IPU available: False, using: 0 IPUs
(_train_tune pid=2469) HPU available: False, using: 0 HPUs
(_train_tune pid=2469) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=2469) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00005_5_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=2469) 2024-05-15 18:48:28.901342: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=2469) 2024-05-15 18:48:28.901397: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=2469) 2024-05-15 18:48:28.902753: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=2469) 2024-05-15 18:48:30.243292: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=2469) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=2469)
(_train_tune pid=2469) | Name | Type | Params
(_train_tune pid=2469) --------------------------------------------------
(_train_tune pid=2469) 0 | loss | MAE | 0
(_train_tune pid=2469) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2469) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2469) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=2469) 4 | context_adapter | Linear | 733 K
(_train_tune pid=2469) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=2469) --------------------------------------------------
(_train_tune pid=2469) 1.2 M Trainable params
(_train_tune pid=2469) 0 Non-trainable params
(_train_tune pid=2469) 1.2 M Total params
(_train_tune pid=2469) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:48:32,276 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00005
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2469, ip=172.28.0.12, actor_id=94b4daf5bcf5e4bcb020ff4201000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00005 errored after 0 iterations at 2024-05-15 18:48:32. Total running time: 1min 16s
Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00005_5_2024-05-15_18-47-16/error.txt
(_train_tune pid=2555) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=2555) Seed set to 1
(_train_tune pid=2555) GPU available: True (cuda), used: True
(_train_tune pid=2555) TPU available: False, using: 0 TPU cores
(_train_tune pid=2555) IPU available: False, using: 0 IPUs
(_train_tune pid=2555) HPU available: False, using: 0 HPUs
(_train_tune pid=2555) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=2555) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00006_6_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=2555) 2024-05-15 18:48:40.138108: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=2555) 2024-05-15 18:48:40.138193: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=2555) 2024-05-15 18:48:40.140225: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=2555) 2024-05-15 18:48:42.122800: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=2555) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=2555)
(_train_tune pid=2555) | Name | Type | Params
(_train_tune pid=2555) --------------------------------------------------
(_train_tune pid=2555) 0 | loss | MAE | 0
(_train_tune pid=2555) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2555) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2555) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=2555) 4 | context_adapter | Linear | 733 K
(_train_tune pid=2555) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=2555) --------------------------------------------------
(_train_tune pid=2555) 1.2 M Trainable params
(_train_tune pid=2555) 0 Non-trainable params
(_train_tune pid=2555) 1.2 M Total params
(_train_tune pid=2555) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:48:44,144 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00006
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2555, ip=172.28.0.12, actor_id=7dc8acbdaa622f3d1410253f01000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00006 errored after 0 iterations at 2024-05-15 18:48:44. Total running time: 1min 28s
Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00006_6_2024-05-15_18-47-16/error.txt
(_train_tune pid=2641) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=2641) Seed set to 1
(_train_tune pid=2641) GPU available: True (cuda), used: True
(_train_tune pid=2641) TPU available: False, using: 0 TPU cores
(_train_tune pid=2641) IPU available: False, using: 0 IPUs
(_train_tune pid=2641) HPU available: False, using: 0 HPUs
(_train_tune pid=2641) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=2641) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00007_7_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=2641) 2024-05-15 18:48:50.935377: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=2641) 2024-05-15 18:48:50.935427: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=2641) 2024-05-15 18:48:50.936953: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=2641) 2024-05-15 18:48:52.507542: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=2641) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=2641)
(_train_tune pid=2641) | Name | Type | Params
(_train_tune pid=2641) --------------------------------------------------
(_train_tune pid=2641) 0 | loss | MAE | 0
(_train_tune pid=2641) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2641) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2641) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=2641) 4 | context_adapter | Linear | 733 K
(_train_tune pid=2641) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=2641) --------------------------------------------------
(_train_tune pid=2641) 1.2 M Trainable params
(_train_tune pid=2641) 0 Non-trainable params
(_train_tune pid=2641) 1.2 M Total params
(_train_tune pid=2641) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:48:55,324 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00007
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2641, ip=172.28.0.12, actor_id=cd65c6f74b72c63491d398b001000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00007 errored after 0 iterations at 2024-05-15 18:48:55. Total running time: 1min 39s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00007_7_2024-05-15_18-47-16/error.txt
(_train_tune pid=2725) Seed set to 1
(_train_tune pid=2725) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=2725) GPU available: True (cuda), used: True
(_train_tune pid=2725) TPU available: False, using: 0 TPU cores
(_train_tune pid=2725) IPU available: False, using: 0 IPUs
(_train_tune pid=2725) HPU available: False, using: 0 HPUs
(_train_tune pid=2725) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=2725) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00008_8_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=2725) 2024-05-15 18:49:03.635740: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=2725) 2024-05-15 18:49:03.635794: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=2725) 2024-05-15 18:49:03.637208: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=2725) 2024-05-15 18:49:04.955429: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=2725) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=2725)
(_train_tune pid=2725) | Name | Type | Params
(_train_tune pid=2725) --------------------------------------------------
(_train_tune pid=2725) 0 | loss | MAE | 0
(_train_tune pid=2725) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2725) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2725) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=2725) 4 | context_adapter | Linear | 733 K
(_train_tune pid=2725) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=2725) --------------------------------------------------
(_train_tune pid=2725) 1.2 M Trainable params
(_train_tune pid=2725) 0 Non-trainable params
(_train_tune pid=2725) 1.2 M Total params
(_train_tune pid=2725) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:49:06,998 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00008
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2725, ip=172.28.0.12, actor_id=fabdb55dfad9439e37e6f6ee01000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00008 errored after 0 iterations at 2024-05-15 18:49:07. Total running time: 1min 51s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00008_8_2024-05-15_18-47-16/error.txt
(_train_tune pid=2812) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=2812) Seed set to 1
(_train_tune pid=2812) GPU available: True (cuda), used: True
(_train_tune pid=2812) TPU available: False, using: 0 TPU cores
(_train_tune pid=2812) IPU available: False, using: 0 IPUs
(_train_tune pid=2812) HPU available: False, using: 0 HPUs
(_train_tune pid=2812) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=2812) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00009_9_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=2812) 2024-05-15 18:49:17.170487: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=2812) 2024-05-15 18:49:17.170542: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=2812) 2024-05-15 18:49:17.171997: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=2812) 2024-05-15 18:49:18.485404: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=2812) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=2812)
(_train_tune pid=2812) | Name | Type | Params
(_train_tune pid=2812) --------------------------------------------------
(_train_tune pid=2812) 0 | loss | MAE | 0
(_train_tune pid=2812) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2812) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2812) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=2812) 4 | context_adapter | Linear | 733 K
(_train_tune pid=2812) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=2812) --------------------------------------------------
(_train_tune pid=2812) 1.2 M Trainable params
(_train_tune pid=2812) 0 Non-trainable params
(_train_tune pid=2812) 1.2 M Total params
(_train_tune pid=2812) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:49:20,526 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00009
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2812, ip=172.28.0.12, actor_id=3c373755967d0aafb25b94ec01000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00009 errored after 0 iterations at 2024-05-15 18:49:20. Total running time: 2min 4s
Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00009_9_2024-05-15_18-47-16/error.txt
(_train_tune pid=2908) Seed set to 1
(_train_tune pid=2908) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=2908) GPU available: True (cuda), used: True
(_train_tune pid=2908) TPU available: False, using: 0 TPU cores
(_train_tune pid=2908) IPU available: False, using: 0 IPUs
(_train_tune pid=2908) HPU available: False, using: 0 HPUs
(_train_tune pid=2908) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=2908) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00010_10_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=2908) 2024-05-15 18:49:29.842498: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=2908) 2024-05-15 18:49:29.842555: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=2908) 2024-05-15 18:49:29.844072: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=2908) 2024-05-15 18:49:31.184552: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=2908) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=2908)
(_train_tune pid=2908) | Name | Type | Params
(_train_tune pid=2908) --------------------------------------------------
(_train_tune pid=2908) 0 | loss | MAE | 0
(_train_tune pid=2908) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2908) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2908) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=2908) 4 | context_adapter | Linear | 733 K
(_train_tune pid=2908) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=2908) --------------------------------------------------
(_train_tune pid=2908) 1.2 M Trainable params
(_train_tune pid=2908) 0 Non-trainable params
(_train_tune pid=2908) 1.2 M Total params
(_train_tune pid=2908) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:49:33,208 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00010
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2908, ip=172.28.0.12, actor_id=75e6cd94f8d65d7463d4795301000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00010 errored after 0 iterations at 2024-05-15 18:49:33. Total running time: 2min 17s
Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00010_10_2024-05-15_18-47-16/error.txt
(_train_tune pid=2993) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=2993) Seed set to 1
(_train_tune pid=2993) GPU available: True (cuda), used: True
(_train_tune pid=2993) TPU available: False, using: 0 TPU cores
(_train_tune pid=2993) IPU available: False, using: 0 IPUs
(_train_tune pid=2993) HPU available: False, using: 0 HPUs
(_train_tune pid=2993) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=2993) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00011_11_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=2993) 2024-05-15 18:49:40.330171: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=2993) 2024-05-15 18:49:40.330237: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=2993) 2024-05-15 18:49:40.333261: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=2993) 2024-05-15 18:49:42.299129: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=2993) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=2993)
(_train_tune pid=2993) | Name | Type | Params
(_train_tune pid=2993) --------------------------------------------------
(_train_tune pid=2993) 0 | loss | MAE | 0
(_train_tune pid=2993) 1 | padder | ConstantPad1d | 0
(_train_tune pid=2993) 2 | scaler | TemporalNorm | 0
(_train_tune pid=2993) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=2993) 4 | context_adapter | Linear | 733 K
(_train_tune pid=2993) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=2993) --------------------------------------------------
(_train_tune pid=2993) 1.2 M Trainable params
(_train_tune pid=2993) 0 Non-trainable params
(_train_tune pid=2993) 1.2 M Total params
(_train_tune pid=2993) 4.880 Total estimated model params size (MB)
Sanity Checking: | | 0/? [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:49:45,101 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00011
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=2993, ip=172.28.0.12, actor_id=4ea9d8dd080bbcba516e06d601000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00011 errored after 0 iterations at 2024-05-15 18:49:45. Total running time: 2min 29s
Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00011_11_2024-05-15_18-47-16/error.txt
(_train_tune pid=3081) Seed set to 1
(_train_tune pid=3081) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=3081) GPU available: True (cuda), used: True
(_train_tune pid=3081) TPU available: False, using: 0 TPU cores
(_train_tune pid=3081) IPU available: False, using: 0 IPUs
(_train_tune pid=3081) HPU available: False, using: 0 HPUs
(_train_tune pid=3081) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=3081) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00012_12_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=3081) 2024-05-15 18:49:52.043224: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=3081) 2024-05-15 18:49:52.043281: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=3081) 2024-05-15 18:49:52.044725: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=3081) 2024-05-15 18:49:53.422987: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=3081) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=3081)
(_train_tune pid=3081) | Name | Type | Params
(_train_tune pid=3081) --------------------------------------------------
(_train_tune pid=3081) 0 | loss | MAE | 0
(_train_tune pid=3081) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3081) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3081) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=3081) 4 | context_adapter | Linear | 733 K
(_train_tune pid=3081) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=3081) --------------------------------------------------
(_train_tune pid=3081) 1.2 M Trainable params
(_train_tune pid=3081) 0 Non-trainable params
(_train_tune pid=3081) 1.2 M Total params
(_train_tune pid=3081) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:49:56,077 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00012
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3081, ip=172.28.0.12, actor_id=6c3cdc79fa3f106c5a87198b01000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00012 errored after 0 iterations at 2024-05-15 18:49:56. Total running time: 2min 40s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00012_12_2024-05-15_18-47-16/error.txt
(_train_tune pid=3163) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=3163) Seed set to 1
(_train_tune pid=3163) GPU available: True (cuda), used: True
(_train_tune pid=3163) TPU available: False, using: 0 TPU cores
(_train_tune pid=3163) IPU available: False, using: 0 IPUs
(_train_tune pid=3163) HPU available: False, using: 0 HPUs
(_train_tune pid=3163) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=3163) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00013_13_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=3163) 2024-05-15 18:50:05.258245: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=3163) 2024-05-15 18:50:05.258300: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=3163) 2024-05-15 18:50:05.259671: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=3163) 2024-05-15 18:50:06.660238: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=3163) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=3163)
(_train_tune pid=3163) | Name | Type | Params
(_train_tune pid=3163) --------------------------------------------------
(_train_tune pid=3163) 0 | loss | MAE | 0
(_train_tune pid=3163) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3163) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3163) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=3163) 4 | context_adapter | Linear | 733 K
(_train_tune pid=3163) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=3163) --------------------------------------------------
(_train_tune pid=3163) 1.2 M Trainable params
(_train_tune pid=3163) 0 Non-trainable params
(_train_tune pid=3163) 1.2 M Total params
(_train_tune pid=3163) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:50:08,865 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00013
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3163, ip=172.28.0.12, actor_id=52d743851cafdc7d4d83cebf01000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00013 errored after 0 iterations at 2024-05-15 18:50:08. Total running time: 2min 52s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00013_13_2024-05-15_18-47-16/error.txt
(_train_tune pid=3254) Seed set to 1
(_train_tune pid=3254) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=3254) GPU available: True (cuda), used: True
(_train_tune pid=3254) TPU available: False, using: 0 TPU cores
(_train_tune pid=3254) IPU available: False, using: 0 IPUs
(_train_tune pid=3254) HPU available: False, using: 0 HPUs
(_train_tune pid=3254) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=3254) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00014_14_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=3254) 2024-05-15 18:50:19.166225: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=3254) 2024-05-15 18:50:19.166276: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=3254) 2024-05-15 18:50:19.167749: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=3254) 2024-05-15 18:50:20.505432: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=3254) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=3254)
(_train_tune pid=3254) | Name | Type | Params
(_train_tune pid=3254) --------------------------------------------------
(_train_tune pid=3254) 0 | loss | MAE | 0
(_train_tune pid=3254) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3254) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3254) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=3254) 4 | context_adapter | Linear | 733 K
(_train_tune pid=3254) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=3254) --------------------------------------------------
(_train_tune pid=3254) 1.2 M Trainable params
(_train_tune pid=3254) 0 Non-trainable params
(_train_tune pid=3254) 1.2 M Total params
(_train_tune pid=3254) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:50:22,545 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00014
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3254, ip=172.28.0.12, actor_id=ecac20e4935be6c35f7019b801000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00014 errored after 0 iterations at 2024-05-15 18:50:22. Total running time: 3min 6s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00014_14_2024-05-15_18-47-16/error.txt
(_train_tune pid=3350) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=3350) Seed set to 1
(_train_tune pid=3350) GPU available: True (cuda), used: True
(_train_tune pid=3350) TPU available: False, using: 0 TPU cores
(_train_tune pid=3350) IPU available: False, using: 0 IPUs
(_train_tune pid=3350) HPU available: False, using: 0 HPUs
(_train_tune pid=3350) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=3350) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00015_15_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=3350) 2024-05-15 18:50:32.416163: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=3350) 2024-05-15 18:50:32.416219: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=3350) 2024-05-15 18:50:32.417601: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=3350) 2024-05-15 18:50:33.770497: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=3350) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=3350)
(_train_tune pid=3350) | Name | Type | Params
(_train_tune pid=3350) --------------------------------------------------
(_train_tune pid=3350) 0 | loss | MAE | 0
(_train_tune pid=3350) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3350) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3350) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=3350) 4 | context_adapter | Linear | 733 K
(_train_tune pid=3350) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=3350) --------------------------------------------------
(_train_tune pid=3350) 1.2 M Trainable params
(_train_tune pid=3350) 0 Non-trainable params
(_train_tune pid=3350) 1.2 M Total params
(_train_tune pid=3350) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:50:35,801 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00015
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3350, ip=172.28.0.12, actor_id=6069a3e61b3cec32c2edcb3901000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00015 errored after 0 iterations at 2024-05-15 18:50:35. Total running time: 3min 19s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00015_15_2024-05-15_18-47-16/error.txt
(_train_tune pid=3440) Seed set to 1
(_train_tune pid=3440) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=3440) GPU available: True (cuda), used: True
(_train_tune pid=3440) TPU available: False, using: 0 TPU cores
(_train_tune pid=3440) IPU available: False, using: 0 IPUs
(_train_tune pid=3440) HPU available: False, using: 0 HPUs
(_train_tune pid=3440) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=3440) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00016_16_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=3440) 2024-05-15 18:50:45.394308: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=3440) 2024-05-15 18:50:45.394386: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=3440) 2024-05-15 18:50:45.396492: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=3440) 2024-05-15 18:50:47.074955: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=3440) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=3440)
(_train_tune pid=3440) | Name | Type | Params
(_train_tune pid=3440) --------------------------------------------------
(_train_tune pid=3440) 0 | loss | MAE | 0
(_train_tune pid=3440) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3440) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3440) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=3440) 4 | context_adapter | Linear | 733 K
(_train_tune pid=3440) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=3440) --------------------------------------------------
(_train_tune pid=3440) 1.2 M Trainable params
(_train_tune pid=3440) 0 Non-trainable params
(_train_tune pid=3440) 1.2 M Total params
(_train_tune pid=3440) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:50:49,131 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00016
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3440, ip=172.28.0.12, actor_id=c305d2cbb98392178ed4779b01000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00016 errored after 0 iterations at 2024-05-15 18:50:49. Total running time: 3min 33s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00016_16_2024-05-15_18-47-16/error.txt
(_train_tune pid=3535) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=3535) Seed set to 1
(_train_tune pid=3535) GPU available: True (cuda), used: True
(_train_tune pid=3535) TPU available: False, using: 0 TPU cores
(_train_tune pid=3535) IPU available: False, using: 0 IPUs
(_train_tune pid=3535) HPU available: False, using: 0 HPUs
(_train_tune pid=3535) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=3535) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00017_17_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=3535) 2024-05-15 18:50:57.128093: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=3535) 2024-05-15 18:50:57.128330: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=3535) 2024-05-15 18:50:57.130380: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=3535) 2024-05-15 18:50:59.109092: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=3535) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=3535)
(_train_tune pid=3535) | Name | Type | Params
(_train_tune pid=3535) --------------------------------------------------
(_train_tune pid=3535) 0 | loss | MAE | 0
(_train_tune pid=3535) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3535) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3535) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=3535) 4 | context_adapter | Linear | 733 K
(_train_tune pid=3535) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=3535) --------------------------------------------------
(_train_tune pid=3535) 1.2 M Trainable params
(_train_tune pid=3535) 0 Non-trainable params
(_train_tune pid=3535) 1.2 M Total params
(_train_tune pid=3535) 4.880 Total estimated model params size (MB)
Sanity Checking: | | 0/? [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:51:01,882 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00017
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3535, ip=172.28.0.12, actor_id=e8d645cb71673bfc6acc62ba01000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00017 errored after 0 iterations at 2024-05-15 18:51:01. Total running time: 3min 45s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00017_17_2024-05-15_18-47-16/error.txt
(_train_tune pid=3625) Seed set to 1
(_train_tune pid=3625) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=3625) GPU available: True (cuda), used: True
(_train_tune pid=3625) TPU available: False, using: 0 TPU cores
(_train_tune pid=3625) IPU available: False, using: 0 IPUs
(_train_tune pid=3625) HPU available: False, using: 0 HPUs
(_train_tune pid=3625) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=3625) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00018_18_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=3625) 2024-05-15 18:51:09.586569: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=3625) 2024-05-15 18:51:09.586624: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=3625) 2024-05-15 18:51:09.588053: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=3625) 2024-05-15 18:51:10.939656: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=3625) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=3625)
(_train_tune pid=3625) | Name | Type | Params
(_train_tune pid=3625) --------------------------------------------------
(_train_tune pid=3625) 0 | loss | MAE | 0
(_train_tune pid=3625) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3625) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3625) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=3625) 4 | context_adapter | Linear | 733 K
(_train_tune pid=3625) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=3625) --------------------------------------------------
(_train_tune pid=3625) 1.2 M Trainable params
(_train_tune pid=3625) 0 Non-trainable params
(_train_tune pid=3625) 1.2 M Total params
(_train_tune pid=3625) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:51:13,382 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00018
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3625, ip=172.28.0.12, actor_id=b85224b75f519021d2f8974001000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in
Trial _train_tune_4003e_00018 errored after 0 iterations at 2024-05-15 18:51:13. Total running time: 3min 57s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00018_18_2024-05-15_18-47-16/error.txt
(_train_tune pid=3707) /usr/local/lib/python3.10/dist-packages/ray/tune/integration/pytorch_lightning.py:198: ray.tune.integration.pytorch_lightning.TuneReportCallback
is deprecated. Use ray.tune.integration.pytorch_lightning.TuneReportCheckpointCallback
instead.
(_train_tune pid=3707) Seed set to 1
(_train_tune pid=3707) GPU available: True (cuda), used: True
(_train_tune pid=3707) TPU available: False, using: 0 TPU cores
(_train_tune pid=3707) IPU available: False, using: 0 IPUs
(_train_tune pid=3707) HPU available: False, using: 0 HPUs
(_train_tune pid=3707) Trainer(val_check_interval=1)
was configured so validation will run after every batch.
(_train_tune pid=3707) Missing logger folder: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/working_dirs/_train_tune_4003e_00019_19_2024-05-15_18-47-16/lightning_logs
(_train_tune pid=3707) 2024-05-15 18:51:22.313082: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(_train_tune pid=3707) 2024-05-15 18:51:22.313161: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(_train_tune pid=3707) 2024-05-15 18:51:22.314817: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(_train_tune pid=3707) 2024-05-15 18:51:23.665602: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
(_train_tune pid=3707) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(_train_tune pid=3707)
(_train_tune pid=3707) | Name | Type | Params
(_train_tune pid=3707) --------------------------------------------------
(_train_tune pid=3707) 0 | loss | MAE | 0
(_train_tune pid=3707) 1 | padder | ConstantPad1d | 0
(_train_tune pid=3707) 2 | scaler | TemporalNorm | 0
(_train_tune pid=3707) 3 | hist_encoder | LSTM | 484 K
(_train_tune pid=3707) 4 | context_adapter | Linear | 733 K
(_train_tune pid=3707) 5 | mlp_decoder | MLP | 2.4 K
(_train_tune pid=3707) --------------------------------------------------
(_train_tune pid=3707) 1.2 M Trainable params
(_train_tune pid=3707) 0 Non-trainable params
(_train_tune pid=3707) 1.2 M Total params
(_train_tune pid=3707) 4.880 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]
2024-05-15 18:51:25,685 ERROR tune_controller.py:1331 -- Trial task failed for trial _train_tune_4003e_00019
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2623, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfMemoryError): ray::ImplicitFunc.train() (pid=3707, ip=172.28.0.12, actor_id=a454d5dc5c3d4c81658a621601000000, repr=_train_tune)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/trainable.py", line 331, in train
raise skipped from exception_cause(skipped)
File "/usr/local/lib/python3.10/dist-packages/ray/air/_internal/util.py", line 98, in run
self._ret = self._target(self._args, self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/tune/trainable/function_trainable.py", line 45, in metric
parameter?
Trial _train_tune_4003e_00019 errored after 0 iterations at 2024-05-15 18:51:25. Total running time: 4min 9s Error file: /tmp/ray/session_2024-05-15_18-47-11_316201_352/artifacts/2024-05-15_18-47-15/_train_tune_2024-05-15_18-47-11/driver_artifacts/_train_tune_4003e_00019_19_2024-05-15_18-47-16/error.txt
RuntimeError Traceback (most recent call last)
It seems to give a GPU OOM error. So your GPU doesn't have enough RAM to run this task. This would also explain why it does run on weekly/monthly, as those will have less data than daily.
The solution is to buy a GPU with more RAM, or run a less compute-intensive experiment (e.g. use less data, or less frequent)
It seems to give a GPU OOM error. So your GPU doesn't have enough RAM to run this task. This would also explain why it does run on weekly/monthly, as those will have less data than daily.
The solution is to buy a GPU with more RAM, or run a less compute-intensive experiment (e.g. use less data, or less frequent)
But I am running it in MS Azure on a compute with GPU and with a RAM of 112GB
It seems to give a GPU OOM error. So your GPU doesn't have enough RAM to run this task. This would also explain why it does run on weekly/monthly, as those will have less data than daily. The solution is to buy a GPU with more RAM, or run a less compute-intensive experiment (e.g. use less data, or less frequent)
But I am running it in MS Azure on a compute with GPU and with a RAM of 112GB
The error relates to GPU RAM not normal (CPU) RAM. So the GPU in your Azure machine does not have enough RAM. Choose an instance that has an A100, for example, that will give you more GPU RAM.
Funny it worked after I ran a loop for each unique_id with the same resources like:
unique_ids = df_daily['unique_id'].unique() for unique_id in unique_ids:
df_sub = df_daily[df_daily['unique_id'] == unique_id]
nf.fit(df=df_sub, val_size=365, sort_df=True, verbose=True)
Y_hat_df = nf.predict()
Y_hat_df['unique_id'] = unique_id # Add the unique_id to the predictions
all_predictions.append(Y_hat_df.reset_index())
What happened + What you expected to happen
I am trying to run this code on a daily granularity of 240 timeseries, but I am getting some error saying that there is no optimal metric for error found and they are all zero, nontheless it worked fine for the same dataset when I transformed it to monthly and weekly
Versions / Dependencies
%pip install neuralforecast "torch<2.0.0"
Reproduction script
%pip install "flaml[automl]"
from flaml import AutoML import pandas as pd from sklearn.model_selection import TimeSeriesSplit
class EnsembleModelTrainer: def init(self, dataframe, unique_id_column, target_column, horizon): self.df = dataframe self.unique_id_column = unique_id_column self.target_column = target_column self.horizon = horizon self.models = {}
horizon = 365 config = dict(max_steps=100, val_check_steps=1, input_size=-1)
Configure models
models = [AutoLSTM(h=horizon,config=config, num_samples=20), AutoRNN(h=horizon,config=config, num_samples=20) ]
Initialize NeuralForecast
nf = NeuralForecast(models=models, freq='D' )
Fit model
nf.fit(df=df_daily, val_size=365, sort_df=True, verbose=True) Y_hat_df = nf.predict() Y_hat_df = Y_hat_df.reset_index() Y_hat_df.to_csv('./daily_neuralforecast.csv')
Issue Severity
High: It blocks me from completing my task.