djiajunustc / TransVG

157 stars 26 forks source link

Pretrained BERT model not found #11

Closed pqviet closed 2 years ago

pqviet commented 2 years ago

I followed your instruction of docker for training, and found the following error message

Model name 'bert-base-uncased' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese). We assumed 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz' was a path or url but couldn't find any file associated to this path or url.

Do you have any ideas?

djiajunustc commented 2 years ago

Hi @pqviet,

I can access to it even I just click the url you rasied above. Please try it again. Thanks.

djiajunustc commented 2 years ago

Besides, 'bert-base-uncased' is the first one that you mentioned in the model name list.

pqviet commented 2 years ago

Thank you for your replies. I found that it was not permitted to access to sub directory under home by default. I changed the cache folder, and was able to get the pretrained BERT model. It started training and stopped with a new error "RuntimeError: DataLoader worker (pid 157) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit." I use 8 GPUS with each 16 GB memory. Is it not enough to perform training?

djiajunustc commented 2 years ago

Hi @pqviet ,

Shared memory is the memory for CPU instead of GPU. When you start the docker container, you can add --shm-size=32Gb (or more) to enlarge the shared memory.

pqviet commented 2 years ago

Thank you. I followed your setting and was able to train with RefCOCOg. Train with referit still did not work due to corrupted files by unzipping.