BinWang28 / SBERT-WK-Sentence-Embedding

IEEE/ACM TASLP 2020: SBERT-WK: A Sentence Embedding Method By Dissecting BERT-based Word Models
Apache License 2.0
177 stars 27 forks source link

Missing file problem #2

Closed None403 closed 4 years ago

None403 commented 4 years ago

Hello, when I was running, it was suggested that three files could not be found. Could you please provide the download link? The three documents are as follows : added_tokens.json、special_tokens_map.json、tokenizer_config.json

BinWang28 commented 4 years ago

It should be automatically downloaded from the server with the following three lines.

    config = AutoConfig.from_pretrained(params_senteval["model_type"], cache_dir='./cache')
    config.output_hidden_states = True
    tokenizer = AutoTokenizer.from_pretrained(params_senteval["model_type"], cache_dir='./cache')
    model = AutoModelWithLMHead.from_pretrained(params_senteval["model_type"], config=config, cache_dir='./cache')

Can you paste your error here so I can debug? Have you created a folder named 'cache' in the main directory?

None403 commented 4 years ago

The script I ran is example2.sh. It seems that other errors occurred. The following is all output:


None403 commented 4 years ago

root@0cb791f65a07:~/SBERT-WK-Sentence-Embedding# ./example2.sh --model_type binwang/bert-base-nli --model_type binwang/bert-base-uncased --embed_method dissecting --max_seq_length 64 --batch_size 1 --context_window_size 2 --layer_start 4 --tasks sts

2020-03-05 09:35:19,095 : Starting new HTTPS connection (1): s3.amazonaws.com:443 2020-03-05 09:35:20,269 : https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/binwang/bert-base-uncased/config.json HTTP/1.1" 200 0 2020-03-05 09:35:20,284 : loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/binwang/bert-base-uncased/config.json from cache at ./cache/19e969ebebc46506a7d80830232146353b99b1f30bff8aff6e115d2dcbcc4afd.913dd763a263b43d0803c5b4cd8e6810e129f390793e910ba19e547a266e6b6f 2020-03-05 09:35:20,286 : Model config { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "do_sample": false, "eos_token_ids": 0, "finetuning_task": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 3072, "is_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-12, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_beams": 1, "num_hidden_layers": 12, "num_labels": 2, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": true, "output_past": true, "pad_token_id": 0, "pruned_heads": {}, "repetition_penalty": 1.0, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "torchscript": false, "type_vocab_size": 2, "use_bfloat16": false, "vocab_size": 30522 }

2020-03-05 09:35:20,288 : Model name 'binwang/bert-base-uncased' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased). Assuming 'binwang/bert-base-uncased' is a path or url to a directory containing tokenizer files. 2020-03-05 09:35:20,288 : Didn't find file binwang/bert-base-uncased/added_tokens.json. We won't load it. 2020-03-05 09:35:20,288 : Didn't find file binwang/bert-base-uncased/special_tokens_map.json. We won't load it. 2020-03-05 09:35:20,288 : Didn't find file binwang/bert-base-uncased/tokenizer_config.json. We won't load it. 2020-03-05 09:35:20,290 : Starting new HTTPS connection (1): s3.amazonaws.com:443 2020-03-05 09:35:21,521 : https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/binwang/bert-base-uncased/vocab.txt HTTP/1.1" 200 0 2020-03-05 09:35:21,528 : loading file https://s3.amazonaws.com/models.huggingface.co/bert/binwang/bert-base-uncased/vocab.txt from cache at ./cache/2c727aa1d252b261a4f15e04ad1beec8403f40d2eab4fbe998f1ae804b522b06.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084 2020-03-05 09:35:21,530 : loading file None 2020-03-05 09:35:21,530 : loading file None 2020-03-05 09:35:21,530 : loading file None 2020-03-05 09:35:21,801 : Starting new HTTPS connection (1): s3.amazonaws.com:443 2020-03-05 09:35:22,963 : https://s3.amazonaws.com:443 "HEAD /models.huggingface.co/bert/binwang/bert-base-uncased/pytorch_model.bin HTTP/1.1" 200 0 2020-03-05 09:35:22,971 : loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/binwang/bert-base-uncased/pytorch_model.bin from cache at ./cache/c1c9f3dafc802586d46285b1383200b5747305ab65e3c92b1a83c18ff82a1b37.e20ed098e5dc4a7be382b8fb2b1438a2271c71d5328590f86d29b64f2c0b23ac 2020-03-05 09:35:29,571 : Weights from pretrained model not used in BertForMaskedLM: ['cls.predictions.decoder.bias'] Traceback (most recent call last): File "sen_emb.py", line 67, in model.to(params['device']) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 426, in to return self._apply(convert) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 202, in _apply module._apply(fn) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 202, in _apply module._apply(fn) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 202, in _apply module._apply(fn) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in _apply param_applied = fn(param) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 424, in convert return t.to(device, dtype if t.is_floating_point() else None, non_blocking) RuntimeError: CUDA error: out of memory

None403 commented 4 years ago

Hello, my friend! These three files can be downloaded successfully,respectively, pytorch_model.bin && config.json && vocab.txt. In addition, the following three files seems failed to download, if convenient, can you upload it again? Thanks!!(#^.^#): added_tokens.json && special_tokens_map.json && tokenizer_config.json

BinWang28 commented 4 years ago

Hi @None403, Thanks for posting the results here.

Your error comes from the GPU memory. I would recommend using a GPU with at least 6GB memory. Or change to setting to CPU. BERT is not a small-sized model. So, it acquires more GPU memory usage.

If you want it working on CPU, which may be slow, you may consider changing the ones XX.to("CUDA") with XX.to("CPU"). Or simply remove it.

BinWang28 commented 4 years ago

Solved.