-
http://fyubang.com/2019/07/08/distributed-training/
瓦砾由于最近bert-large用的比较多,踩了很多分布式训练的坑,加上在TensorFlow和PyTorch之间更换,算是熟悉了一下各类框架的分布式训练接口,由于集中在一起讲可能比较乱,笔者准备分三到四篇来讲一下深度学习的分布式训练。这一篇先讲一下“分布式训练的类型与算法”。 分布式训练…
-
I am getting this error as soon as I get this
INFO:root:Writing example 0 of 9067886
I am running the standard code on a AWS Sagemaker with pytorch. Both the error stack and the code used is past…
-
**Describe the bug**
Model I am using UniLM:
I use the following code to load the model.
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("micro…
-
### Bug description
When we train an XLNet model with `CLM` masking, the model prints out its own evaluation metrics (ndcg@k, recall@k, etc.) from `trainer.evaluate()` step. If we want to apply our o…
rnyak updated
11 months ago
-
Hi, I'm following the guide, and everything seems to work except when I'm creating a predictor object I get:
File "bert.py", line 63, in
do_lower_case=False)
File "/home/w3pt/.local/lib/…
-
This could be a good first issue to contribute towards this repo. Comment below if you want to help and I will get you started. There is a long list to migrate so the more hands we have working on thi…
-
Hello,
I'm getting another error in the next step in the process, do you have an idea why this might be? I tried to debug a little bit myself, but I wasn't easily able to find where the error was b…
-
When i train a fastbert model and save it using save_and_reload(), the model output is not consistent with the models output before saving.
code to reproduce:
```
from fast_bert import BertClas…
-
I have two questions
firstly:I'm running train_allennlp_local.sh with a bug:
ModuleNotFoundError: No module named 'scibert'
YML@Spuer-HR:~/Jiaxin/scibert-master/scripts$ ./train_allennlp_lo…
-
## Environment info
- `transformers` version: `4.6.0.dev0`
- Platform: `CentOS Linux release 7.7.1908 (Core)`
- Python version: `3.8.5`
- PyTorch version: `1.8.1 + cuda 10.2`
- Tensorflow versi…