Closed vietnguyen012 closed 1 year ago
Exalate commented:
sara-tagger commented:
Thanks for submitting this feature request
Exalate commented:
vietnguyen012 commented:
Hello, can anyone help me?
ā¤ Maxime Verger commented:
:bulb: Heads up! We're moving issues to Jira: https://rasa-open-source.atlassian.net/browse/OSS.
From now on, this Jira board is the place where you can browse (without an account) and create issues (you'll need a free Jira account for that). This GitHub issue has already been migrated to Jira and will be closed on January 9th, 2023. Do not forget to subscribe to the corresponding Jira issue!
:arrow_right: More information in the forum: https://forum.rasa.com/t/migration-of-rasa-oss-issues-to-jira/56569.
What problem are you trying to solve?
I'm trying to use multiple gpus to optimize ted policy during training time.
What's your suggested solution?
At first, I tried to add mirrored strategy to training function of ted policy. Below is the run_training in ted_policy.py, I added a segment where I use mirrored strategy and the old segment which uses only one gpu.
Examples (if relevant)
def run_training( self, model_data: RasaModelData, label_ids: Optional[np.ndarray] = None ) -> None: """Feeds the featurized training data to the model.
Args: model_data: Featurized training data. label_ids: Label ids corresponding to the data points in
model_data
. These may or may not be used by the function depending on how the policy is trained. """os.environ.pop('TF_CONFIG', None)
tf_config = {
'cluster': { # 'worker': ['localhost:12345', 'localhost:23456'] # }
,
'task': {'type': 'worker', 'index': 0}
}
os.environ['TF_CONFIG'] = json.dumps(tf_config)
tf_config = json.loads(os.environ['TF_CONFIG'])
num_workers = len(tf_config['cluster']['worker']) if not self.finetune_mode:
This means the model wasn't loaded from a
previously trained model and hence needs
to be instantiated. self.model = self.model_class()( model_data.get_signature(), self.config, isinstance(self.featurizer, MaxHistoryTrackerFeaturizer), self._label_data, self._entity_tag_specs, ) self.model.compile( optimizer=tf.keras.optimizers.Adam(self.config[LEARNING_RATE]) ) ( data_generator, validation_data_generator, ) = rasa.utils.train_utils.create_data_generators( model_data, self.config[BATCH_SIZES], self.config[EPOCHS], self.config[BATCH_STRATEGY], self.config[EVAL_NUM_EXAMPLES], self.config[RANDOM_SEED], ) callbacks = rasa.utils.train_utils.create_common_callbacks( self.config[EPOCHS], self.config[TENSORBOARD_LOG_DIR], self.config[TENSORBOARD_LOG_LEVEL], self.tmp_checkpoint_dir, ) self.model.fit( data_generator, epochs=self.config[EPOCHS], validation_data=validation_data_generator, validation_freq=self.config[EVAL_NUM_EPOCHS], callbacks=callbacks, verbose=False, shuffle=False, # we use custom shuffle inside data generator )
global_batch_size = self.config[BATCH_SIZES]*2
tf.debugging.set_log_device_placement(True) gpus = tf.config.list_logical_devices('GPU') strategy = tf.distribute.MirroredStrategy(gpus)
if not self.finetune_mode:
This means the model wasn't loaded from a
previously trained model and hence needs
to be instantiated. with strategy.scope(): self.model = self.model_class()( model_data.get_signature(), self.config, isinstance(self.featurizer, MaxHistoryTrackerFeaturizer), self._label_data, self._entity_tag_specs, ) self.model.compile( optimizer=tf.keras.optimizers.Adam(self.config[LEARNING_RATE]) )
( data_generator, validation_data_generator, ) = rasa.utils.train_utils.create_data_generators( model_data, global_batch_size, self.config[EPOCHS], self.config[BATCH_STRATEGY], self.config[EVAL_NUM_EXAMPLES], self.config[RANDOM_SEED], ) callbacks = rasa.utils.train_utils.create_common_callbacks( self.config[EPOCHS], self.config[TENSORBOARD_LOG_DIR], self.config[TENSORBOARD_LOG_LEVEL], self.tmp_checkpoint_dir, ) self.model.fit( data_generator, epochs=self.config[EPOCHS], validation_data=validation_data_generator, validation_freq=self.config[EVAL_NUM_EPOCHS], callbacks=callbacks, verbose=False, shuffle=False, # we use custom shuffle inside data generator
The first run without mirrored strategy is okay but when running with mirrored strategy, there is a conflict in ted model. I can't figure out where is the cause of it (the distributed training makes it very hard to debug).
Is anything blocking this from being implemented? (if relevant)
and this log after running:
/root/rasa/rasa/shared/core/slot_mappings.py:216: UserWarning: Slot auto-fill has been removed in 3.0 and replaced with a new explicit mechanism to set slots. Please refer to https://rasa.com/docs/rasa/domain#slots to learn more. UserWarning, /root/rasa/rasa/shared/core/slot_mappings.py:216: UserWarning: Slot auto-fill has been removed in 3.0 and replaced with a new explicit mechanism to set slots. Please refer to https://rasa.com/docs/rasa/domain#slots to learn more. UserWarning, Processed story blocks: 100%|āāā| 13/13 00:00<00:00, 1271.09it/s, # trackers=1 Processed story blocks: 100%|āāā| 13/13 00:00<00:00, 148.73it/s, # trackers=12 Processed story blocks: 100%|āāāā| 13/13 00:00<00:00, 21.72it/s, # trackers=50 Processed story blocks: 100%|āāāā| 13/13 00:00<00:00, 26.74it/s, # trackers=50 Processed rules: 100%|āāāāāāāāāāā| 48/48 00:00<00:00, 252.89it/s, # trackers=1 /root/rasa/rasa/utils/train_utils.py:530: UserWarning: constrain_similarities is set to
2022-02-17 08:55:10.764997: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2) /root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/framework/indexed_slices.py:449: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradients/cond_grad/Identity_1:0", shape=(None,), dtype=int64), values=Tensor("gradients/cond_grad/Identity:0", shape=(None,), dtype=float32), dense_shape=Tensor("gradients/cond_grad/Identity_2:0", shape=(1,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory. "shape. This may consume a large amount of memory." % value) /root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/framework/indexed_slices.py:449: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradients/cond_1_grad/Identity_1:0", shape=(None,), dtype=int64), values=Tensor("gradients/cond_1_grad/Identity:0", shape=(None,), dtype=float32), dense_shape=Tensor("gradients/cond_1_grad/Identity_2:0", shape=(1,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory. "shape. This may consume a large amount of memory." % value) /root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/framework/indexed_slices.py:449: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradients/cond_2_grad/Identity_1:0", shape=(None,), dtype=int64), values=Tensor("gradients/cond_2_grad/Identity:0", shape=(None,), dtype=float32), dense_shape=Tensor("gradients/cond_2_grad/Identity_2:0", shape=(1,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory. "shape. This may consume a large amount of memory." % value) Epochs: 50%|āāā | 1/2 [00:33<00:33, 33.25s/it, t_loss=6, loss=5.72, acc=0.518]/root/rasa/rasa/utils/tensorflow/model_data.py:750: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray np.concatenate(np.array(f)), Epochs: 100%|āā| 2/2 [00:53<00:00, 26.99s/it, t_loss=5.44, loss=4.94, acc=0.921] Epochs: 0%| | 0/2 [00:00 WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `run` inside a tf.function to get the best performance. 2022-02-17 08:56:05.702741: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:695] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_2" op: "FlatMapDataset" input: "TensorDataset/_1" attr { key: "Targuments" value { list { } } } attr { key: "f" value { func { name: "__inference_Dataset_flat_map_flat_map_fn_21967" } } } attr { key: "output_shapes" value { list { shape { dim { size: -1 } dim { size: -1 } dim { size: -1 } } shape { dim { size: -1 } dim { size: -1 } } shape { dim { size: -1 } } shape { dim { size: -1 } } shape { dim { size: -1 } } shape { dim { size: -1 } dim { size: -1 } dim { size: -1 } } shape { dim { size: -1 } dim { size: -1 } } shape { dim { size: -1 } } shape { dim { size: -1 } } shape { dim { size: -1 } dim { size: -1 } dim { size: -1 } } shape { dim { size: -1 } dim { size: -1 } dim { size: -1 } } shape { dim { size: -1 } dim { size: -1 } } shape { dim { size: -1 } } shape { dim { size: -1 } } } } } attr { key: "output_types" value { list { type: DT_FLOAT type: DT_INT64 type: DT_FLOAT type: DT_INT64 type: DT_FLOAT type: DT_FLOAT type: DT_INT64 type: DT_FLOAT type: DT_INT64 type: DT_FLOAT type: DT_FLOAT type: DT_INT64 type: DT_FLOAT type: DT_INT64 } } } . Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`. /root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/framework/indexed_slices.py:449: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradients/cond_4_grad/Identity_1:0", shape=(None,), dtype=int64), values=Tensor("gradients/cond_4_grad/Identity:0", shape=(None,), dtype=float32), dense_shape=Tensor("gradients/cond_4_grad/Identity_2:0", shape=(1,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory. "shape. This may consume a large amount of memory." % value) /root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/framework/indexed_slices.py:449: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradients/cond_5_grad/Identity_1:0", shape=(None,), dtype=int64), values=Tensor("gradients/cond_5_grad/Identity:0", shape=(None,), dtype=float32), dense_shape=Tensor("gradients/cond_5_grad/Identity_2:0", shape=(1,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory. "shape. This may consume a large amount of memory." % value) /root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/framework/indexed_slices.py:449: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradients/cond_6_grad/Identity_1:0", shape=(None,), dtype=int64), values=Tensor("gradients/cond_6_grad/Identity:0", shape=(None,), dtype=float32), dense_shape=Tensor("gradients/cond_6_grad/Identity_2:0", shape=(1,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory. "shape. This may consume a large amount of memory." % value) /root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/framework/indexed_slices.py:449: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradients/replica_1/cond_grad/Identity_1:0", shape=(None,), dtype=int64), values=Tensor("gradients/replica_1/cond_grad/Identity:0", shape=(None,), dtype=float32), dense_shape=Tensor("gradients/replica_1/cond_grad/Identity_2:0", shape=(1,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory. "shape. This may consume a large amount of memory." % value) /root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/framework/indexed_slices.py:449: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradients/replica_1/cond_1_grad/Identity_1:0", shape=(None,), dtype=int64), values=Tensor("gradients/replica_1/cond_1_grad/Identity:0", shape=(None,), dtype=float32), dense_shape=Tensor("gradients/replica_1/cond_1_grad/Identity_2:0", shape=(1,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory. "shape. This may consume a large amount of memory." % value) /root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/framework/indexed_slices.py:449: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("gradients/replica_1/cond_2_grad/Identity_1:0", shape=(None,), dtype=int64), values=Tensor("gradients/replica_1/cond_2_grad/Identity:0", shape=(None,), dtype=float32), dense_shape=Tensor("gradients/replica_1/cond_2_grad/Identity_2:0", shape=(1,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory. "shape. This may consume a large amount of memory." % value) Traceback (most recent call last): File "/root/rasa/rasa/engine/graph.py", line 467, in __call__ output = self._fn(self._component, **run_kwargs) File "/root/rasa/rasa/core/policies/ted_policy.py", line 777, in train self.run_training(model_data, label_ids) File "/root/rasa/rasa/core/policies/ted_policy.py", line 740, in run_training shuffle=False, # we use custom shuffle inside data generator File "/root/rasa/rasa/utils/tensorflow/temp_keras_modules.py", line 190, in fit tmp_logs = train_function(iterator) File "/root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 885, in __call__ result = self._call(*args, **kwds) File "/root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 950, in _call return self._stateless_fn(*args, **kwds) File "/root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3040, in __call__ filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access File "/root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1964, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 596, in call ctx=ctx) File "/root/rasa/.venv/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.InvalidArgumentError: 3 root error(s) found. (0) Invalid argument: Dimensions [0,1) of indices[shape=[17,2]] must match dimensions [0,1) of updates[shape=[24,50]] [[{{node cond_4/StatefulPartitionedCall/cond_4_20/then/_877/cond_4/ScatterNd}}]] [[div_no_nan_1/ReadVariableOp/_892]] (1) Invalid argument: Dimensions [0,1) of indices[shape=[17,2]] must match dimensions [0,1) of updates[shape=[24,50]] [[{{node cond_4/StatefulPartitionedCall/cond_4_20/then/_877/cond_4/ScatterNd}}]] (2) Invalid argument: Dimensions [0,1) of indices[shape=[17,2]] must match dimensions [0,1) of updates[shape=[24,50]] [[{{node cond_4/StatefulPartitionedCall/cond_4_20/then/_877/cond_4/ScatterNd}}]] [[update_0/AssignAddVariableOp/_845]] 0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_52312] Function call stack: train_function -> train_function -> train_function The above exception was the direct cause of the following exception: Traceback (most recent call last): File "run.py", line 54, inFalse
. It is recommended to set it toTrue
when using cross-entropy loss. category=UserWarning, /root/rasa/rasa/shared/utils/io.py:99: UserWarning: 'evaluate_every_number_of_epochs=20' is greater than 'epochs=2'. No evaluation will occur. Processed trackers: 100%|āāāāā| 512/512 00:00<00:00, 965.76it/s, # action=1635 /root/rasa/rasa/utils/tensorflow/model_data_utils.py:384: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray np.array(values), number_of_dimensions=4 /root/rasa/rasa/utils/tensorflow/model_data_utils.py:400: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray MASK: [FeatureArray(np.array(attribute_masks), number_of_dimensions=3)] 2022-02-17 08:55:08.464804: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-02-17 08:55:09.599068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30652 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:0a:00.0, compute capability: 7.0 2022-02-17 08:55:09.601503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 30652 MB memory: -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:85:00.0, compute capability: 7.0 /root/rasa/rasa/utils/tensorflow/model_data.py:750: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray np.concatenate(np.array(f)), Epochs: 0%| | 0/2 [00:00