Resolves #98 after recent tokenization related changes.
I removed the hacky modifications of test_e2e_training_run_wout_ckpt, which added an own dataset into the used training config. Like this, it is more explicitly visible, which resources are used for this e2e-test and what should get considered when touch these resources.
I added hf-tokenizer instance, which was referred to beforehand, but was missing in the repo. I assume that it is something located on one of the DGX-boxes in the changes' author's own setup. I highly encoured to run the e2e tests not only locally but also e.g. in the ci or in an own virtualized setups (if GPUs should get verified and we don't have those yet in our CI) to avoid this "works on my machine"-scenarios.
I marked the sentence piece tokenizer unittest to get skipped until a loadable tokenizer resource is actually present (see my previous point here). In contrast to the hf-tokenizer, I don't have a pretrained sentencepiece tokenizer at hand and would therefore ask the author of this test (@mali-git ) to provide this one and activate the test again.
I ran into a weird problem, which was caused by the field scheduler.config.total_steps in the training config. Apparently this value was too low and the system was continuously trying exceed the set value and crashed when doing so. I don't know the field and I was not able to find any docstrings on it. My assumption is that it kinda defines the expected amount of training steps to do the loss-adaptions. I would expect that this should get dynamically linked to the amount of training examples chosen for a training run (probably settings.training.global_num_seen_samples @le1nux (I found you as author from this field in the lorem_ipsum training config, therefore you might know there more)).
I found that the e2e tests of the "text-only" setup (test_e2e_training_run_wout_ckpt) and the one from the CoCa architecture (test_e2e_coca_training_run_without_checkpoint) collide when using global ressources like the NCCL-context and also the rich.live.Live object. I introduced for both a cleanup as they are handled in a singleton manner. For the NCCL-context the changes were already there in the CudaEnv-context manager, but were commented out. I found you, @le1nux be the author there. Why were those commented out? Are you concerned about the nccl-cleanup, which could cause problems during the last iteration of a training, if done before e.g. a proper checkpointing was possible? @spravil , regarding the e2e- test, this might not be the last time, these two collide, therefore this could be also interesting to know for you
Resolves #98 after recent tokenization related changes.
test_e2e_training_run_wout_ckpt
, which added an own dataset into the used training config. Like this, it is more explicitly visible, which resources are used for this e2e-test and what should get considered when touch these resources.scheduler.config.total_steps
in the training config. Apparently this value was too low and the system was continuously trying exceed the set value and crashed when doing so. I don't know the field and I was not able to find any docstrings on it. My assumption is that it kinda defines the expected amount of training steps to do the loss-adaptions. I would expect that this should get dynamically linked to the amount of training examples chosen for a training run (probablysettings.training.global_num_seen_samples
@le1nux (I found you as author from this field in the lorem_ipsum training config, therefore you might know there more)).test_e2e_training_run_wout_ckpt
) and the one from the CoCa architecture (test_e2e_coca_training_run_without_checkpoint
) collide when using global ressources like the NCCL-context and also therich.live.Live
object. I introduced for both a cleanup as they are handled in a singleton manner. For the NCCL-context the changes were already there in theCudaEnv
-context manager, but were commented out. I found you, @le1nux be the author there. Why were those commented out? Are you concerned about the nccl-cleanup, which could cause problems during the last iteration of a training, if done before e.g. a proper checkpointing was possible? @spravil , regarding the e2e- test, this might not be the last time, these two collide, therefore this could be also interesting to know for you