-
I am trying to continue training the pre-trained FullSubNet model provided by this repo:
[fullsubnet_best_model_58epochs.tar](https://github.com/haoxiangsnr/FullSubNet/releases/download/v0.2/fullsu…
-
environment:
CUDA 1.17
tensorflow2.14
code:
https://github.com/tensorflow/models/blob/master/official/recommendation/ncf_keras_main.py
command:
python3 /LLM/models/official/recommendation/…
-
AdaNet doesn't currently support [`tf.distribute.Strategy`](https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy). The current way to define distributed training is using a [`tf.estimator…
-
### Bug description
I followed [this](https://curiousily.com/posts/multi-label-text-classification-with-bert-and-pytorch-lightning/) tutorial to build a lightning model for multi-label text classific…
-
Hi,
I've seen some strange behavior when training on TPU (v3-8 from TFRC). After 600k steps (using the default parameters for a base model) training got stuck. I could see two different types of er…
-
### Feature Area
Federated learning is crucial for Hemodialysis patient data analysis. Its benefits are two folds. First, it can help to predict abrupt pressure drop which is lethal during Hemodialys…
-
# Summary
There may be areas of expertise and learning where professional training could help our team grow their skills more quickly. We should have a process for team training that answers the fo…
-
**Your question**
I run pretrain_gpt on same arch, data, training hyperparams and same hardware, with and without using megatron_core when build the model.
I notice clearly **worse wall clock time a…
-
**Describe the bug**
I'm trying to use deepspeed to finetune a bert based classification model, but when trying to launch multi-node training all nodes include localhost get errno: 110 - Connection t…
-
Super cool and amazing work!
I am writing to ask for your assistance with an issue I am encountering while training a model using A6000 GPUs. I am using the following command to run my code:
```…