Closed pqviet closed 2 years ago
Hi @pqviet,
I can access to it even I just click the url you rasied above. Please try it again. Thanks.
Besides, 'bert-base-uncased' is the first one that you mentioned in the model name list.
Thank you for your replies. I found that it was not permitted to access to sub directory under home by default. I changed the cache folder, and was able to get the pretrained BERT model. It started training and stopped with a new error "RuntimeError: DataLoader worker (pid 157) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit." I use 8 GPUS with each 16 GB memory. Is it not enough to perform training?
Hi @pqviet ,
Shared memory is the memory for CPU instead of GPU. When you start the docker container, you can add --shm-size=32Gb (or more) to enlarge the shared memory.
Thank you. I followed your setting and was able to train with RefCOCOg. Train with referit still did not work due to corrupted files by unzipping.
I followed your instruction of docker for training, and found the following error message
Model name 'bert-base-uncased' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz' was a path or url but couldn't find any file associated to this path or url.
Do you have any ideas?