Missing config_gen yml files

Hi, I tried to run Posterior distillation without M-BERT finetuning and got the following error:

Traceback (most recent call last):
  File "train_with_teacher.py", line 104, in <module>
    teachers=teacher_func()
  File "/home/mlej8/projects/MultilangStructureKD/flair/config_parser.py", line 235, in create_teachers_list
    config=Params.from_file(filename)
  File "/home/mlej8/projects/MultilangStructureKD/flair/utils/params.py", line 102, in from_file
    with open(params_file, encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'config_gen/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_fast_nodev_ner12.yaml'

I looked around the repo and did not find any config_gen directory. Is it possible that these files weren't uploaded ?

Hi, I'm sorry for that. I have fixed the config files and upload some of the config files for teacher models in config.

Note that you need to train the teacher models (for example, config/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_fast_nodev_ner12.yaml) at first. Please check the guide of training teacher models in README.md.

Please contact with me if there is still any problem.

Hi Xinyu,

Thanks for your prompt response.

The requirements.txt file also caused conflicts when installing dependencies. To reproduce the errors:

conda create --name python=3.6.12
conda activate name_of_env
pip install -r requirements

I believe the following should be updated urllib3=1.25.10 mxnet=1.5.0 numpy=1.16.1

because the old version of numpy (1.14.6) is not compatible with mxnet=1.4.1 which requires numpy > 1.15.0

As for KD, I am using the teacher models provided in the Google Drive as mentioned in the README.md.

Also while training Posterior distillation without M-BERT finetuning I received the following error:

[2021-02-05 10:28:44,691 INFO] loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-pytorch_model.bin from cache at /home/mlej8/.cache/torch/pytorch_transformers/5b5b80054cd2c95a946a8e0ce0b93f56326dff9fbda6a6c3e02de3c91c918342.7131dcb754361639a7d5526985f880879c9bfd144b65a0bf50590bddb7de9059
Traceback (most recent call last):
  File "train_with_teacher.py", line 104, in <module>
    teachers=teacher_func()
  File "/home/mlej8/projects/MultilangStructureKD/flair/config_parser.py", line 237, in create_teachers_list
    teacher_model=self.create_model(config, pretrained=True)
  File "/home/mlej8/projects/MultilangStructureKD/flair/config_parser.py", line 165, in create_model
    embeddings, word_map, char_map=self.create_embeddings(config['embeddings'])
  File "/home/mlej8/projects/MultilangStructureKD/flair/config_parser.py", line 146, in create_embeddings
    embedding_list.append(getattr(Embeddings,embedding.split('-')[0])(**embeddings[embedding]))
  File "/home/mlej8/projects/MultilangStructureKD/flair/embeddings.py", line 2201, in __init__
    model = cached_path(base_path, cache_dir=cache_dir)
  File "/home/mlej8/projects/MultilangStructureKD/flair/file_utils.py", line 88, in cached_path
    return get_from_cache(url_or_filename, dataset_cache)
  File "/home/mlej8/projects/MultilangStructureKD/flair/file_utils.py", line 164, in get_from_cache
    f"HEAD request failed for url {url} with status code {response.status_code}."
OSError: HEAD request failed for url https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-mix-german-forward-v0.2rc.pt with status code 301.

This tells us that the flair library is requesting for a file at https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/embeddings/lm-mix-german-forward-v0.2rc.pt that is permanently moved. I went on flair's GitHub repo and saw the following:

Therefore, I think flair might need to be updated to solve this issue.

Looking at the source code of the flair library:

https://github.com/flairNLP/flair/blob/3fd32f7f3ea32df82bc569a706ab55550aac7338/flair/embeddings/token.py#L373

The FlairEmbeddings class has indeed been updated to point to the new server:

While the old FlairEmbeddings in your codebase is still pointing to the old amazon server:

I'll be happy to make a pull request if needed, but I think it would be best if you could update the flair library in your source code, thank you :)

After updating the FlairEmbeddings class, I am encountering the following issue:

  File "/home/mlej8/projects/MultilangStructureKD/train_with_teacher.py", line 104, in <module>
    teachers=teacher_func()
  File "/home/mlej8/projects/MultilangStructureKD/flair/config_parser.py", line 237, in create_teachers_list
    teacher_model=self.create_model(config, pretrained=True)
  File "/home/mlej8/projects/MultilangStructureKD/flair/config_parser.py", line 165, in create_model
    embeddings, word_map, char_map=self.create_embeddings(config['embeddings'])
  File "/home/mlej8/projects/MultilangStructureKD/flair/config_parser.py", line 146, in create_embeddings
    embedding_list.append(getattr(Embeddings,embedding.split('-')[0])(**embeddings[embedding]))
  File "/home/mlej8/projects/MultilangStructureKD/flair/embeddings.py", line 2285, in __init__
    embedded_dummy = self.embed(dummy_sentence)
  File "/home/mlej8/projects/MultilangStructureKD/flair/embeddings.py", line 103, in embed
    self._add_embeddings_internal(sentences)
  File "/home/mlej8/projects/MultilangStructureKD/flair/embeddings.py", line 2332, in _add_embeddings_internal
    text_sentences, start_marker, end_marker, self.chars_per_chunk
TypeError: get_representation() takes from 2 to 3 positional arguments but 5 were given

After further inspection, I found that within your language_model class in flair/models/language_model.py, the get_representation function declaration is https://github.com/Alibaba-NLP/MultilangStructureKD/blob/017b65fd12a34f64f6d0894be31f86df08694724/flair/models/language_model.py#L102

But the call to get the hidden states from the language model is

Therefore, I think this method's declaration needs to be updated.

I downloaded the monolingual models from the Google Drive provided in the README and it seems that the names of the folder don't match ^.

"multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_nodev_ner0" is the folder name on the Google Drive

After updating the FlairEmbeddings class, I am encountering the following issue:

  File "/home/mlej8/projects/MultilangStructureKD/train_with_teacher.py", line 104, in <module>
    teachers=teacher_func()
  File "/home/mlej8/projects/MultilangStructureKD/flair/config_parser.py", line 237, in create_teachers_list
    teacher_model=self.create_model(config, pretrained=True)
  File "/home/mlej8/projects/MultilangStructureKD/flair/config_parser.py", line 165, in create_model
    embeddings, word_map, char_map=self.create_embeddings(config['embeddings'])
  File "/home/mlej8/projects/MultilangStructureKD/flair/config_parser.py", line 146, in create_embeddings
    embedding_list.append(getattr(Embeddings,embedding.split('-')[0])(**embeddings[embedding]))
  File "/home/mlej8/projects/MultilangStructureKD/flair/embeddings.py", line 2285, in __init__
    embedded_dummy = self.embed(dummy_sentence)
  File "/home/mlej8/projects/MultilangStructureKD/flair/embeddings.py", line 103, in embed
    self._add_embeddings_internal(sentences)
  File "/home/mlej8/projects/MultilangStructureKD/flair/embeddings.py", line 2332, in _add_embeddings_internal
    text_sentences, start_marker, end_marker, self.chars_per_chunk
TypeError: get_representation() takes from 2 to 3 positional arguments but 5 were given

But the call to get the hidden states from the language model is

Therefore, I think this method's declaration needs to be updated.

I think you may only update the link to the code server but not the whole class

I downloaded the monolingual models from the Google Drive provided in the README and it seems that the names of the folder don't match ^.

"multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_nodev_ner0" is the folder name on the Google Drive

You need to change 'teachers' in the config file to the config file of 'multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_nodev_ner0' which is 'config/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_nodev_ner0.yaml'

Hi, thanks for you feedbacks. I have updated the download links in FlairEmbeddings and FastWordEmbeddings and updated the requirements.txt

Hi Xin Yu,

thanks for the updates. Also, please note that the WordEmbeddings also need new paths:

I will update you on the status of my reproduction :)

Hi Xin Yu,

thanks for the updates. Also, please note that the WordEmbeddings also need new paths:

I will update you on the status of my reproduction :)

Yes, the update contains the new path for WordEmbeddings.

Thank you for the update Xin Yu. I am just a bit confused that the command for training the Multilingual Model without M-BERT finetuning for Top-K is the same as the command for running Top-WK on the README. Is it possible that one of the is using the wrong config file?

Thank you for the update Xin Yu. I am just a bit confused that the command for training the Multilingual Model without M-BERT finetuning for Top-K is the same as the command for running Top-WK on the README. Is it possible that one of the is using the wrong config file?

Hi, sorry for the mistake. I have uploaded the correct config file for Top-WK. Please check README.md

There seems to be an issue in the SequenceTagger class in sequence_tagger_model.py. When I run experiments with Posterior KD without M-BERT finetuning, I get:

Traceback (most recent call last):
  File "train_with_teacher.py", line 246, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 294, in train
    train_data=self.assign_pretrained_teacher_targets(coupled_train_data,self.teachers,best_k=best_k)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 756, in assign_pretrained_teacher_targets
    backward_var = teacher._backward_alg(logits, lengths1)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 1034, in _backward_alg
    if self.enhanced_crf:
  File "/home/michael1441/.conda/envs/KD/lib/python3.6/site-packages/torch/nn/modules/module.py", line 576, in __getattr__
    type(self).__name__, name))
AttributeError: 'SequenceTagger' object has no attribute 'enhanced_crf'

looking at the SequenceTagger class, we can see that this attribute has not been defined

I keep encountering this error while running the experiments:

Traceback (most recent call last):
  File "train_with_teacher.py", line 246, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 294, in train
    train_data=self.assign_pretrained_teacher_targets(coupled_train_data,self.teachers,best_k=best_k)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 750, in assign_pretrained_teacher_targets
    logits=teacher.forward(teacher_input)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 665, in forward
    self.embeddings.embed(sentences)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/embeddings.py", line 169, in embed
    embedding.embed(sentences)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/embeddings.py", line 90, in embed
    self._add_embeddings_internal(sentences)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/embeddings.py", line 3636, in _add_embeddings_internal
    mean = torch.mean(torch.cat(embeddings, dim=0), dim=0)
RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat.  This usually means that this function requires a non-empty list of Tensors.  Available functions are [CUDATensorId, CPUTensorId, VariableTensorId]

basically, in the function _add_embeddings_internal(sentences) in BertEmbeddings, within the following loop:

 # get the current sentence object
                token_idx = 0
                for posidx, token in enumerate(sentence):
                    # add concatenated embedding to sentence
                    token_idx += 1

                    if self.pooling_operation == "first":
                        # use first subword embedding if pooling operation is 'first'
                        token.set_embedding(self.name, subtoken_embeddings[token_idx])
                    else:
                        # otherwise, do a mean over all subwords in token
                        embeddings = subtoken_embeddings[
                            token_idx : token_idx
                            + feature.token_subtoken_count[token.idx]
                        ]
                        embeddings = [
                            embedding.unsqueeze(0) for embedding in embeddings
                        ]
                        mean = torch.mean(torch.cat(embeddings, dim=0), dim=0)
                        token.set_embedding(self.name, mean)

                    token_idx += feature.token_subtoken_count[token.idx] - 1

token_idx becomes > 512, therefore embeddings = []. Then the empty list is fed into torch.cat([],dim=0) which causes this issue.

I keep encountering this error while running the experiments:

Traceback (most recent call last):
  File "train_with_teacher.py", line 246, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 294, in train
    train_data=self.assign_pretrained_teacher_targets(coupled_train_data,self.teachers,best_k=best_k)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 750, in assign_pretrained_teacher_targets
    logits=teacher.forward(teacher_input)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 665, in forward
    self.embeddings.embed(sentences)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/embeddings.py", line 169, in embed
    embedding.embed(sentences)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/embeddings.py", line 90, in embed
    self._add_embeddings_internal(sentences)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/embeddings.py", line 3636, in _add_embeddings_internal
    mean = torch.mean(torch.cat(embeddings, dim=0), dim=0)
RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat.  This usually means that this function requires a non-empty list of Tensors.  Available functions are [CUDATensorId, CPUTensorId, VariableTensorId]

basically, in the function _add_embeddings_internal(sentences) in BertEmbeddings, within the following loop:

 # get the current sentence object
                token_idx = 0
                for posidx, token in enumerate(sentence):
                    # add concatenated embedding to sentence
                    token_idx += 1

                    if self.pooling_operation == "first":
                        # use first subword embedding if pooling operation is 'first'
                        token.set_embedding(self.name, subtoken_embeddings[token_idx])
                    else:
                        # otherwise, do a mean over all subwords in token
                        embeddings = subtoken_embeddings[
                            token_idx : token_idx
                            + feature.token_subtoken_count[token.idx]
                        ]
                        embeddings = [
                            embedding.unsqueeze(0) for embedding in embeddings
                        ]
                        mean = torch.mean(torch.cat(embeddings, dim=0), dim=0)
                        token.set_embedding(self.name, mean)

                    token_idx += feature.token_subtoken_count[token.idx] - 1

token_idx becomes > 512, therefore embeddings = []. Then the empty list is fed into torch.cat([],dim=0) which causes this issue.

The possible solution is manually split the sentences that have more than 512 subtokens in the training set. Or you may use the TransformerWordEmbeddings instead of BERTEmbeddings.

There seems to be an issue in the SequenceTagger class in sequence_tagger_model.py. When I run experiments with Posterior KD without M-BERT finetuning, I get:

Traceback (most recent call last):
  File "train_with_teacher.py", line 246, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 294, in train
    train_data=self.assign_pretrained_teacher_targets(coupled_train_data,self.teachers,best_k=best_k)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 756, in assign_pretrained_teacher_targets
    backward_var = teacher._backward_alg(logits, lengths1)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 1034, in _backward_alg
    if self.enhanced_crf:
  File "/home/michael1441/.conda/envs/KD/lib/python3.6/site-packages/torch/nn/modules/module.py", line 576, in __getattr__
    type(self).__name__, name))
AttributeError: 'SequenceTagger' object has no attribute 'enhanced_crf'

looking at the SequenceTagger class, we can see that this attribute has not been defined

I fixed this issue, please check the code

I keep encountering this error while running the experiments:

Traceback (most recent call last):
  File "train_with_teacher.py", line 246, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 294, in train
    train_data=self.assign_pretrained_teacher_targets(coupled_train_data,self.teachers,best_k=best_k)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 750, in assign_pretrained_teacher_targets
    logits=teacher.forward(teacher_input)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 665, in forward
    self.embeddings.embed(sentences)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/embeddings.py", line 169, in embed
    embedding.embed(sentences)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/embeddings.py", line 90, in embed
    self._add_embeddings_internal(sentences)
  File "/home/michael1441/projects/tmp/MultilangStructureKD/flair/embeddings.py", line 3636, in _add_embeddings_internal
    mean = torch.mean(torch.cat(embeddings, dim=0), dim=0)
RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat.  This usually means that this function requires a non-empty list of Tensors.  Available functions are [CUDATensorId, CPUTensorId, VariableTensorId]

basically, in the function _add_embeddings_internal(sentences) in BertEmbeddings, within the following loop:

 # get the current sentence object
                token_idx = 0
                for posidx, token in enumerate(sentence):
                    # add concatenated embedding to sentence
                    token_idx += 1

                    if self.pooling_operation == "first":
                        # use first subword embedding if pooling operation is 'first'
                        token.set_embedding(self.name, subtoken_embeddings[token_idx])
                    else:
                        # otherwise, do a mean over all subwords in token
                        embeddings = subtoken_embeddings[
                            token_idx : token_idx
                            + feature.token_subtoken_count[token.idx]
                        ]
                        embeddings = [
                            embedding.unsqueeze(0) for embedding in embeddings
                        ]
                        mean = torch.mean(torch.cat(embeddings, dim=0), dim=0)
                        token.set_embedding(self.name, mean)

                    token_idx += feature.token_subtoken_count[token.idx] - 1

token_idx becomes > 512, therefore embeddings = []. Then the empty list is fed into torch.cat([],dim=0) which causes this issue.

The possible solution is manually split the sentences that have more than 512 subtokens in the training set. Or you may use the TransformerWordEmbeddings instead of BERTEmbeddings.

Thank you for your prompt response, I will try using TransformerWordEmbeddings. I believe I can change the embeddings in the config files:

Can I simply change it to embeddings: TransformerWordEmbeddings: bert_model_or_path: bert-base-multilingual-cased layers: '-1' pooling_operation: mean

Could you please confirm this is correct?

you also need to change bert_model_or_path to model

If I use TransformerWordEmbeddings, I would need to retrain the teacher models I tried training the DE teacher model using TransformerWordEmbeddings and got the follwoing:

2021-02-13 00:02:14,701 ----------------------------------------------------------------------------------------------------
2021-02-13 00:02:14,701 Corpus: "Corpus: 12705 train + 3068 dev + 3160 test sentences"
2021-02-13 00:02:14,701 ----------------------------------------------------------------------------------------------------
2021-02-13 00:02:14,701 Parameters:
2021-02-13 00:02:14,702  - learning_rate: "0.1"
2021-02-13 00:02:14,702  - mini_batch_size: "2000"
2021-02-13 00:02:14,702  - patience: "10"
2021-02-13 00:02:14,702  - anneal_factor: "0.5"
2021-02-13 00:02:14,702  - max_epochs: "300"
2021-02-13 00:02:14,702  - shuffle: "True"
2021-02-13 00:02:14,702  - train_with_dev: "False"
2021-02-13 00:02:14,702 ----------------------------------------------------------------------------------------------------
2021-02-13 00:02:14,702 Model training base path: "resources/taggers/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_nodev_ner0"
2021-02-13 00:02:14,702 ----------------------------------------------------------------------------------------------------
2021-02-13 00:02:14,702 Device: cuda:0
2021-02-13 00:02:14,702 ----------------------------------------------------------------------------------------------------
2021-02-13 00:02:14,702 Embeddings storage mode: cpu
2021-02-13 00:02:15,598 ----------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 396, in train
    loss = self.model.forward_loss(student_input)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 528, in forward_loss
    features = self.forward(data_points)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 702, in forward
    sentence_tensor = self.embedding2nn(sentence_tensor)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/functional.py", line 1372, in linear
    output = input.matmul(weight.t())
RuntimeError: size mismatch, m1: [1980 x 4396], m2: [5164 x 5164] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:290
> /home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py(410)train()
-> torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
(Pdb) 
Traceback (most recent call last):
  File "train_with_teacher.py", line 246, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 410, in train
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 410, in train
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit

If I use TransformerWordEmbeddings, I would need to retrain the teacher models I tried training the DE teacher model using TransformerWordEmbeddings and got the follwoing:

2021-02-13 00:02:14,701 ----------------------------------------------------------------------------------------------------
2021-02-13 00:02:14,701 Corpus: "Corpus: 12705 train + 3068 dev + 3160 test sentences"
2021-02-13 00:02:14,701 ----------------------------------------------------------------------------------------------------
2021-02-13 00:02:14,701 Parameters:
2021-02-13 00:02:14,702  - learning_rate: "0.1"
2021-02-13 00:02:14,702  - mini_batch_size: "2000"
2021-02-13 00:02:14,702  - patience: "10"
2021-02-13 00:02:14,702  - anneal_factor: "0.5"
2021-02-13 00:02:14,702  - max_epochs: "300"
2021-02-13 00:02:14,702  - shuffle: "True"
2021-02-13 00:02:14,702  - train_with_dev: "False"
2021-02-13 00:02:14,702 ----------------------------------------------------------------------------------------------------
2021-02-13 00:02:14,702 Model training base path: "resources/taggers/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_nodev_ner0"
2021-02-13 00:02:14,702 ----------------------------------------------------------------------------------------------------
2021-02-13 00:02:14,702 Device: cuda:0
2021-02-13 00:02:14,702 ----------------------------------------------------------------------------------------------------
2021-02-13 00:02:14,702 Embeddings storage mode: cpu
2021-02-13 00:02:15,598 ----------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 396, in train
    loss = self.model.forward_loss(student_input)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 528, in forward_loss
    features = self.forward(data_points)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 702, in forward
    sentence_tensor = self.embedding2nn(sentence_tensor)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/functional.py", line 1372, in linear
    output = input.matmul(weight.t())
RuntimeError: size mismatch, m1: [1980 x 4396], m2: [5164 x 5164] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:290
> /home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py(410)train()
-> torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
(Pdb) 
Traceback (most recent call last):
  File "train_with_teacher.py", line 246, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 410, in train
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 410, in train
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit

You do not need to train the teacher model again in fact. Simply keep the teacher config file unchanged. For this problem, could you let me see the config file?

Here is the config file that caused the error when I tried to train DE teacher model with TransformerWordEmbeddings.

ModelDistiller:
  distill_mode: false
  train_with_professor: false
anneal_factor: 2
embeddings:
  TransformerWordEmbeddings:
    model: bert-base-multilingual-cased
    layers: '-1'
  FlairEmbeddings-1:
    model: de-forward
  FlairEmbeddings-2:
    model: de-backward
  WordEmbeddings:
    embeddings: de
interpolation: 0.5
is_teacher_list: true
model:
  SequenceTagger:
    hidden_size: 256
    sentence_loss: true
    use_crf: true
model_name: multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_nodev_ner0
ner:
  Corpus: CONLL_03_GERMAN
  professors:
    config/single-de-ner.yaml: CONLL_03_GERMAN
    config/single-en-ner.yaml: CONLL_03
    config/single-es-ner.yaml: CONLL_03_SPANISH
    config/single-nl-ner.yaml: CONLL_03_DUTCH
  tag_dictionary: resources/taggers/ner_tags.pkl
  teachers:
    config/multi_bert_flair_2000batch_1lr_de_monolingual_nocrf_sentloss_10patience_baseline_nodev_ner1.yaml: CONLL_03_GERMAN
    config/multi_bert_flair_2000batch_1lr_en_monolingual_nocrf_sentloss_10patience_baseline_nodev_ner1.yaml: CONLL_03
    config/multi_bert_flair_2000batch_1lr_es_monolingual_nocrf_sentloss_10patience_baseline_nodev_ner0.yaml: CONLL_03_SPANISH
    config/multi_bert_flair_2000batch_1lr_nl_monolingual_nocrf_sentloss_10patience_baseline_nodev_ner1.yaml: CONLL_03_DUTCH
target_dir: resources/taggers/
targets: ner
teacher_annealing: false
train:
  learning_rate: 0.1
  max_epochs: 300
  mini_batch_size: 2000
  monitor_test: false
  patience: 10
  professor_interpolation: 0.5
  save_final_model: false
  train_with_dev: false
upos:
  Corpus: UD_GERMAN:UD_ENGLISH:UD_FRENCH:UD_ITALIAN:UD_DUTCH:UD_SPANISH:UD_PORTUGUESE:UD_CHINESE
  UD_GERMAN:
    train_config: config/
  tag_dictionary: resources/taggers/pos_tags.pkl

This is the log from trying Posterior distillation without M-BERT finetuning

2021-02-13 00:39:16,861 Reading data from /home/michael1441/.flair/datasets/conll_03_dutch
2021-02-13 00:39:16,861 Train: /home/michael1441/.flair/datasets/conll_03_dutch/ned.train
2021-02-13 00:39:16,861 Dev: /home/michael1441/.flair/datasets/conll_03_dutch/ned.testa
2021-02-13 00:39:16,861 Test: /home/michael1441/.flair/datasets/conll_03_dutch/ned.testb
2021-02-13 00:39:16,861 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_dutch/ned.train ... using "latin-1" instead.
2021-02-13 00:39:20,282 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_dutch/ned.testb ... using "latin-1" instead.
2021-02-13 00:39:21,359 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_dutch/ned.testa ... using "latin-1" instead.
2021-02-13 00:39:21,734 Reading data from /home/michael1441/.flair/datasets/conll_03_spanish
2021-02-13 00:39:21,734 Train: /home/michael1441/.flair/datasets/conll_03_spanish/esp.train
2021-02-13 00:39:21,734 Dev: /home/michael1441/.flair/datasets/conll_03_spanish/esp.testa
2021-02-13 00:39:21,735 Test: /home/michael1441/.flair/datasets/conll_03_spanish/esp.testb
2021-02-13 00:39:21,735 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_spanish/esp.train ... using "latin-1" instead.
2021-02-13 00:39:26,111 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_spanish/esp.testb ... using "latin-1" instead.
2021-02-13 00:39:26,525 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_spanish/esp.testa ... using "latin-1" instead.
2021-02-13 00:39:26,956 Reading data from /home/michael1441/.flair/datasets/conll_03
2021-02-13 00:39:26,956 Train: /home/michael1441/.flair/datasets/conll_03/eng.train
2021-02-13 00:39:26,956 Dev: /home/michael1441/.flair/datasets/conll_03/eng.testa
2021-02-13 00:39:26,956 Test: /home/michael1441/.flair/datasets/conll_03/eng.testb
2021-02-13 00:39:33,951 Reading data from /home/michael1441/.flair/datasets/conll_03_german
2021-02-13 00:39:33,951 Train: /home/michael1441/.flair/datasets/conll_03_german/deu.train
2021-02-13 00:39:33,951 Dev: /home/michael1441/.flair/datasets/conll_03_german/deu.testa
2021-02-13 00:39:33,951 Test: /home/michael1441/.flair/datasets/conll_03_german/deu.testb
2021-02-13 00:39:33,951 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_german/deu.train ... using "latin-1" instead.
2021-02-13 00:39:38,848 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_german/deu.testb ... using "latin-1" instead.
2021-02-13 00:39:42,040 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_german/deu.testa ... using "latin-1" instead.
2021-02-13 00:39:44,260 {b'<unk>': 0, b'O': 1, b'B-PER': 2, b'E-PER': 3, b'S-LOC': 4, b'B-MISC': 5, b'I-MISC': 6, b'E-MISC': 7, b'S-MISC': 8, b'S-PER': 9, b'B-ORG': 10, b'E-ORG': 11, b'S-ORG': 12, b'I-ORG': 13, b'B-LOC': 14, b'E-LOC': 15, b'I-PER': 16, b'I-LOC': 17, b'<START>': 18, b'<STOP>': 19}
2021-02-13 00:39:44,261 Corpus: 51821 train + 11344 dev + 13556 test sentences
2021-02-13 00:39:54.091395: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-02-13 00:39:54.091451: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
/home/michael1441/projects/MultilangStructureKD/flair/utils/params.py:104: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  dict_merge.dict_merge(params_dict, yaml.load(f))
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
[2021-02-13 00:39:59,295 INFO] loading Word2VecKeyedVectors object from /home/michael1441/.flair/embeddings/de-wiki-fasttext-300d-1M
[2021-02-13 00:40:00,506 INFO] loading vectors from /home/michael1441/.flair/embeddings/de-wiki-fasttext-300d-1M.vectors.npy with mmap=None
[2021-02-13 00:40:00,962 INFO] setting ignored attribute vectors_norm to None
[2021-02-13 00:40:00,962 INFO] loaded /home/michael1441/.flair/embeddings/de-wiki-fasttext-300d-1M
2021-02-13 00:40:01,248 Loading pretraining best model
2021-02-13 00:40:01,248 loading file resources/taggers/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_nodev_ner0/best-model.pt
/home/michael1441/projects/MultilangStructureKD/flair/utils/params.py:104: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  dict_merge.dict_merge(params_dict, yaml.load(f))
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
[2021-02-13 00:40:15,667 INFO] loading Word2VecKeyedVectors object from /home/michael1441/.flair/embeddings/en-fasttext-news-300d-1M
[2021-02-13 00:40:16,761 INFO] loading vectors from /home/michael1441/.flair/embeddings/en-fasttext-news-300d-1M.vectors.npy with mmap=None
[2021-02-13 00:40:17,212 INFO] setting ignored attribute vectors_norm to None
[2021-02-13 00:40:17,212 INFO] loaded /home/michael1441/.flair/embeddings/en-fasttext-news-300d-1M
2021-02-13 00:40:17,499 Loading pretraining best model
2021-02-13 00:40:17,499 loading file resources/taggers/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_en_monolingual_crf_sentloss_10patience_baseline_nodev_ner0/best-model.pt
/home/michael1441/projects/MultilangStructureKD/flair/utils/params.py:104: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  dict_merge.dict_merge(params_dict, yaml.load(f))
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
[2021-02-13 00:40:33,065 INFO] loading Word2VecKeyedVectors object from /home/michael1441/.flair/embeddings/es-wiki-fasttext-300d-1M
[2021-02-13 00:40:34,321 INFO] loading vectors from /home/michael1441/.flair/embeddings/es-wiki-fasttext-300d-1M.vectors.npy with mmap=None
[2021-02-13 00:40:34,766 INFO] setting ignored attribute vectors_norm to None
[2021-02-13 00:40:34,767 INFO] loaded /home/michael1441/.flair/embeddings/es-wiki-fasttext-300d-1M
2021-02-13 00:40:35,060 Loading pretraining best model
2021-02-13 00:40:35,060 loading file resources/taggers/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_es_monolingual_crf_sentloss_10patience_baseline_nodev_ner1/best-model.pt
/home/michael1441/projects/MultilangStructureKD/flair/utils/params.py:104: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  dict_merge.dict_merge(params_dict, yaml.load(f))
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
[2021-02-13 00:40:45,102 INFO] loading Word2VecKeyedVectors object from /home/michael1441/.flair/embeddings/nl-wiki-fasttext-300d-1M
[2021-02-13 00:40:52,235 INFO] loading vectors from /home/michael1441/.flair/embeddings/nl-wiki-fasttext-300d-1M.vectors.npy with mmap=None
[2021-02-13 00:40:52,657 INFO] setting ignored attribute vectors_norm to None
[2021-02-13 00:40:52,657 INFO] loaded /home/michael1441/.flair/embeddings/nl-wiki-fasttext-300d-1M
2021-02-13 00:40:52,941 Loading pretraining best model
2021-02-13 00:40:52,941 loading file resources/taggers/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_nl_monolingual_crf_sentloss_10patience_baseline_nodev_ner0/best-model.pt
2021-02-13 00:42:18,237 ----------------------------------------------------------------------------------------------------
2021-02-13 00:42:18,239 Model: "FastSequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): TransformerWordEmbeddings(
      (model): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(119547, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (1): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (2): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (3): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (4): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (5): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (6): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (7): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (8): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (9): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (10): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (11): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
        )
        (pooler): BertPooler(
          (dense): Linear(in_features=768, out_features=768, bias=True)
          (activation): Tanh()
        )
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=768, out_features=768, bias=True)
  (rnn): LSTM(768, 600, bidirectional=True)
  (linear): Linear(in_features=1200, out_features=20, bias=True)
)"
2021-02-13 00:42:18,239 ----------------------------------------------------------------------------------------------------
2021-02-13 00:42:18,239 Corpus: "Corpus: 51821 train + 11344 dev + 13556 test sentences"
2021-02-13 00:42:18,239 ----------------------------------------------------------------------------------------------------
2021-02-13 00:42:18,239 Parameters:
2021-02-13 00:42:18,239  - learning_rate: "0.1"
2021-02-13 00:42:18,239  - mini_batch_size: "2000"
2021-02-13 00:42:18,239  - patience: "10"
2021-02-13 00:42:18,239  - anneal_factor: "0.5"
2021-02-13 00:42:18,239  - max_epochs: "300"
2021-02-13 00:42:18,239  - shuffle: "True"
2021-02-13 00:42:18,239  - train_with_dev: "False"
2021-02-13 00:42:18,239 ----------------------------------------------------------------------------------------------------
2021-02-13 00:42:18,240 Model training base path: "resources/taggers/multi_bert_300epoch_0.5anneal_2000batch_0.1lr_600hidden_multilingual_crf_sentloss_10patience_distill_fast_posterior_2.25temperature_old_relearn_nodev_fast_new_ner0"
2021-02-13 00:42:18,240 ----------------------------------------------------------------------------------------------------
2021-02-13 00:42:18,240 Device: cuda:0
2021-02-13 00:42:18,240 ----------------------------------------------------------------------------------------------------
2021-02-13 00:42:18,240 Embeddings storage mode: cpu
2021-02-13 00:42:18,241 Distilling sentences as targets...
[2021-02-13 00:47:19,728 WARNING] Token indices sequence length is longer than the specified maximum sequence length for this model (916 > 512). Running this sequence through the model will result in indexing errors
[2021-02-13 00:47:19,777 WARNING] Token indices sequence length is longer than the specified maximum sequence length for this model (3710 > 512). Running this sequence through the model will result in indexing errors
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [18,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [20,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [21,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [22,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [23,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "train_with_teacher.py", line 246, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 294, in train
    train_data=self.assign_pretrained_teacher_targets(coupled_train_data,self.teachers,best_k=best_k)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 750, in assign_pretrained_teacher_targets
    logits=teacher.forward(teacher_input)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 667, in forward
    self.embeddings.embed(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 177, in embed
    embedding.embed(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 89, in embed
    self._add_embeddings_internal(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 3390, in _add_embeddings_internal
    all_encoder_layers = self.model(all_input_ids, attention_mask=all_input_masks)[
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 712, in forward
    embedding_output = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 268, in forward
    embeddings = words_embeddings + position_embeddings + token_type_embeddings
RuntimeError: CUDA error: device-side assert triggered

You can looked at my forked repo here: https://github.com/mlej8/MultilangStructureKD The config files that I used are:

config/multi_bert_300epoch_0.5anneal_2000batch_0.1lr_600hidden_multilingual_crf_sentloss_10patience_distill_fast_posterior_2.25temperature_old_relearn_nodev_fast_new_ner0.yaml
config/multi_bert_flair_2000batch_1lr_de_monolingual_nocrf_sentloss_10patience_baseline_nodev_ner1.yaml
config/multi_bert_flair_2000batch_1lr_en_monolingual_nocrf_sentloss_10patience_baseline_nodev_ner1.yaml
config/multi_bert_flair_2000batch_1lr_es_monolingual_nocrf_sentloss_10patience_baseline_nodev_ner0.yaml
config/multi_bert_flair_2000batch_1lr_nl_monolingual_nocrf_sentloss_10patience_baseline_nodev_ner1.yaml

This is the log from trying Posterior distillation without M-BERT finetuning

2021-02-13 00:39:16,861 Reading data from /home/michael1441/.flair/datasets/conll_03_dutch
2021-02-13 00:39:16,861 Train: /home/michael1441/.flair/datasets/conll_03_dutch/ned.train
2021-02-13 00:39:16,861 Dev: /home/michael1441/.flair/datasets/conll_03_dutch/ned.testa
2021-02-13 00:39:16,861 Test: /home/michael1441/.flair/datasets/conll_03_dutch/ned.testb
2021-02-13 00:39:16,861 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_dutch/ned.train ... using "latin-1" instead.
2021-02-13 00:39:20,282 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_dutch/ned.testb ... using "latin-1" instead.
2021-02-13 00:39:21,359 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_dutch/ned.testa ... using "latin-1" instead.
2021-02-13 00:39:21,734 Reading data from /home/michael1441/.flair/datasets/conll_03_spanish
2021-02-13 00:39:21,734 Train: /home/michael1441/.flair/datasets/conll_03_spanish/esp.train
2021-02-13 00:39:21,734 Dev: /home/michael1441/.flair/datasets/conll_03_spanish/esp.testa
2021-02-13 00:39:21,735 Test: /home/michael1441/.flair/datasets/conll_03_spanish/esp.testb
2021-02-13 00:39:21,735 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_spanish/esp.train ... using "latin-1" instead.
2021-02-13 00:39:26,111 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_spanish/esp.testb ... using "latin-1" instead.
2021-02-13 00:39:26,525 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_spanish/esp.testa ... using "latin-1" instead.
2021-02-13 00:39:26,956 Reading data from /home/michael1441/.flair/datasets/conll_03
2021-02-13 00:39:26,956 Train: /home/michael1441/.flair/datasets/conll_03/eng.train
2021-02-13 00:39:26,956 Dev: /home/michael1441/.flair/datasets/conll_03/eng.testa
2021-02-13 00:39:26,956 Test: /home/michael1441/.flair/datasets/conll_03/eng.testb
2021-02-13 00:39:33,951 Reading data from /home/michael1441/.flair/datasets/conll_03_german
2021-02-13 00:39:33,951 Train: /home/michael1441/.flair/datasets/conll_03_german/deu.train
2021-02-13 00:39:33,951 Dev: /home/michael1441/.flair/datasets/conll_03_german/deu.testa
2021-02-13 00:39:33,951 Test: /home/michael1441/.flair/datasets/conll_03_german/deu.testb
2021-02-13 00:39:33,951 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_german/deu.train ... using "latin-1" instead.
2021-02-13 00:39:38,848 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_german/deu.testb ... using "latin-1" instead.
2021-02-13 00:39:42,040 UTF-8 can't read: /home/michael1441/.flair/datasets/conll_03_german/deu.testa ... using "latin-1" instead.
2021-02-13 00:39:44,260 {b'<unk>': 0, b'O': 1, b'B-PER': 2, b'E-PER': 3, b'S-LOC': 4, b'B-MISC': 5, b'I-MISC': 6, b'E-MISC': 7, b'S-MISC': 8, b'S-PER': 9, b'B-ORG': 10, b'E-ORG': 11, b'S-ORG': 12, b'I-ORG': 13, b'B-LOC': 14, b'E-LOC': 15, b'I-PER': 16, b'I-LOC': 17, b'<START>': 18, b'<STOP>': 19}
2021-02-13 00:39:44,261 Corpus: 51821 train + 11344 dev + 13556 test sentences
2021-02-13 00:39:54.091395: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-02-13 00:39:54.091451: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
/home/michael1441/projects/MultilangStructureKD/flair/utils/params.py:104: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  dict_merge.dict_merge(params_dict, yaml.load(f))
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
[2021-02-13 00:39:59,295 INFO] loading Word2VecKeyedVectors object from /home/michael1441/.flair/embeddings/de-wiki-fasttext-300d-1M
[2021-02-13 00:40:00,506 INFO] loading vectors from /home/michael1441/.flair/embeddings/de-wiki-fasttext-300d-1M.vectors.npy with mmap=None
[2021-02-13 00:40:00,962 INFO] setting ignored attribute vectors_norm to None
[2021-02-13 00:40:00,962 INFO] loaded /home/michael1441/.flair/embeddings/de-wiki-fasttext-300d-1M
2021-02-13 00:40:01,248 Loading pretraining best model
2021-02-13 00:40:01,248 loading file resources/taggers/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_nodev_ner0/best-model.pt
/home/michael1441/projects/MultilangStructureKD/flair/utils/params.py:104: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  dict_merge.dict_merge(params_dict, yaml.load(f))
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
[2021-02-13 00:40:15,667 INFO] loading Word2VecKeyedVectors object from /home/michael1441/.flair/embeddings/en-fasttext-news-300d-1M
[2021-02-13 00:40:16,761 INFO] loading vectors from /home/michael1441/.flair/embeddings/en-fasttext-news-300d-1M.vectors.npy with mmap=None
[2021-02-13 00:40:17,212 INFO] setting ignored attribute vectors_norm to None
[2021-02-13 00:40:17,212 INFO] loaded /home/michael1441/.flair/embeddings/en-fasttext-news-300d-1M
2021-02-13 00:40:17,499 Loading pretraining best model
2021-02-13 00:40:17,499 loading file resources/taggers/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_en_monolingual_crf_sentloss_10patience_baseline_nodev_ner0/best-model.pt
/home/michael1441/projects/MultilangStructureKD/flair/utils/params.py:104: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  dict_merge.dict_merge(params_dict, yaml.load(f))
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
[2021-02-13 00:40:33,065 INFO] loading Word2VecKeyedVectors object from /home/michael1441/.flair/embeddings/es-wiki-fasttext-300d-1M
[2021-02-13 00:40:34,321 INFO] loading vectors from /home/michael1441/.flair/embeddings/es-wiki-fasttext-300d-1M.vectors.npy with mmap=None
[2021-02-13 00:40:34,766 INFO] setting ignored attribute vectors_norm to None
[2021-02-13 00:40:34,767 INFO] loaded /home/michael1441/.flair/embeddings/es-wiki-fasttext-300d-1M
2021-02-13 00:40:35,060 Loading pretraining best model
2021-02-13 00:40:35,060 loading file resources/taggers/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_es_monolingual_crf_sentloss_10patience_baseline_nodev_ner1/best-model.pt
/home/michael1441/projects/MultilangStructureKD/flair/utils/params.py:104: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  dict_merge.dict_merge(params_dict, yaml.load(f))
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/urllib3/connectionpool.py:988: InsecureRequestWarning: Unverified HTTPS request is being made to host '127.0.0.1'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning,
[2021-02-13 00:40:45,102 INFO] loading Word2VecKeyedVectors object from /home/michael1441/.flair/embeddings/nl-wiki-fasttext-300d-1M
[2021-02-13 00:40:52,235 INFO] loading vectors from /home/michael1441/.flair/embeddings/nl-wiki-fasttext-300d-1M.vectors.npy with mmap=None
[2021-02-13 00:40:52,657 INFO] setting ignored attribute vectors_norm to None
[2021-02-13 00:40:52,657 INFO] loaded /home/michael1441/.flair/embeddings/nl-wiki-fasttext-300d-1M
2021-02-13 00:40:52,941 Loading pretraining best model
2021-02-13 00:40:52,941 loading file resources/taggers/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_nl_monolingual_crf_sentloss_10patience_baseline_nodev_ner0/best-model.pt
2021-02-13 00:42:18,237 ----------------------------------------------------------------------------------------------------
2021-02-13 00:42:18,239 Model: "FastSequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): TransformerWordEmbeddings(
      (model): BertModel(
        (embeddings): BertEmbeddings(
          (word_embeddings): Embedding(119547, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (token_type_embeddings): Embedding(2, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): BertEncoder(
          (layer): ModuleList(
            (0): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (1): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (2): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (3): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (4): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (5): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (6): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (7): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (8): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (9): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (10): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
            (11): BertLayer(
              (attention): BertAttention(
                (self): BertSelfAttention(
                  (query): Linear(in_features=768, out_features=768, bias=True)
                  (key): Linear(in_features=768, out_features=768, bias=True)
                  (value): Linear(in_features=768, out_features=768, bias=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
                (output): BertSelfOutput(
                  (dense): Linear(in_features=768, out_features=768, bias=True)
                  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                  (dropout): Dropout(p=0.1, inplace=False)
                )
              )
              (intermediate): BertIntermediate(
                (dense): Linear(in_features=768, out_features=3072, bias=True)
              )
              (output): BertOutput(
                (dense): Linear(in_features=3072, out_features=768, bias=True)
                (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
            )
          )
        )
        (pooler): BertPooler(
          (dense): Linear(in_features=768, out_features=768, bias=True)
          (activation): Tanh()
        )
      )
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=768, out_features=768, bias=True)
  (rnn): LSTM(768, 600, bidirectional=True)
  (linear): Linear(in_features=1200, out_features=20, bias=True)
)"
2021-02-13 00:42:18,239 ----------------------------------------------------------------------------------------------------
2021-02-13 00:42:18,239 Corpus: "Corpus: 51821 train + 11344 dev + 13556 test sentences"
2021-02-13 00:42:18,239 ----------------------------------------------------------------------------------------------------
2021-02-13 00:42:18,239 Parameters:
2021-02-13 00:42:18,239  - learning_rate: "0.1"
2021-02-13 00:42:18,239  - mini_batch_size: "2000"
2021-02-13 00:42:18,239  - patience: "10"
2021-02-13 00:42:18,239  - anneal_factor: "0.5"
2021-02-13 00:42:18,239  - max_epochs: "300"
2021-02-13 00:42:18,239  - shuffle: "True"
2021-02-13 00:42:18,239  - train_with_dev: "False"
2021-02-13 00:42:18,239 ----------------------------------------------------------------------------------------------------
2021-02-13 00:42:18,240 Model training base path: "resources/taggers/multi_bert_300epoch_0.5anneal_2000batch_0.1lr_600hidden_multilingual_crf_sentloss_10patience_distill_fast_posterior_2.25temperature_old_relearn_nodev_fast_new_ner0"
2021-02-13 00:42:18,240 ----------------------------------------------------------------------------------------------------
2021-02-13 00:42:18,240 Device: cuda:0
2021-02-13 00:42:18,240 ----------------------------------------------------------------------------------------------------
2021-02-13 00:42:18,240 Embeddings storage mode: cpu
2021-02-13 00:42:18,241 Distilling sentences as targets...
[2021-02-13 00:47:19,728 WARNING] Token indices sequence length is longer than the specified maximum sequence length for this model (916 > 512). Running this sequence through the model will result in indexing errors
[2021-02-13 00:47:19,777 WARNING] Token indices sequence length is longer than the specified maximum sequence length for this model (3710 > 512). Running this sequence through the model will result in indexing errors
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [18,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [20,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [21,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [22,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [23,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [149,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "train_with_teacher.py", line 246, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 294, in train
    train_data=self.assign_pretrained_teacher_targets(coupled_train_data,self.teachers,best_k=best_k)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 750, in assign_pretrained_teacher_targets
    logits=teacher.forward(teacher_input)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 667, in forward
    self.embeddings.embed(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 177, in embed
    embedding.embed(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 89, in embed
    self._add_embeddings_internal(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 3390, in _add_embeddings_internal
    all_encoder_layers = self.model(all_input_ids, attention_mask=all_input_masks)[
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 712, in forward
    embedding_output = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 268, in forward
    embeddings = words_embeddings + position_embeddings + token_type_embeddings
RuntimeError: CUDA error: device-side assert triggered

You can looked at my forked repo here: https://github.com/mlej8/MultilangStructureKD The config files that I used are:

* config/multi_bert_300epoch_0.5anneal_2000batch_0.1lr_600hidden_multilingual_crf_sentloss_10patience_distill_fast_posterior_2.25temperature_old_relearn_nodev_fast_new_ner0.yaml

* config/multi_bert_flair_2000batch_1lr_de_monolingual_nocrf_sentloss_10patience_baseline_nodev_ner1.yaml

* config/multi_bert_flair_2000batch_1lr_en_monolingual_nocrf_sentloss_10patience_baseline_nodev_ner1.yaml

* config/multi_bert_flair_2000batch_1lr_es_monolingual_nocrf_sentloss_10patience_baseline_nodev_ner0.yaml

* config/multi_bert_flair_2000batch_1lr_nl_monolingual_nocrf_sentloss_10patience_baseline_nodev_ner1.yaml

The bug is still caused by the long sequences in the dataset. Can you try to load the pretrained model by TransformerWordEmbeddings? If it failed, you may need to train the teacher model again by the class. By the way, I tried your shared config file for training and the code runs correctly. I'm not sure for the bug in training the teacher model.

I tried loading the pretrained model (from Google Drive) with TransformerWordEmbeddings and the error above ^ is what I got. (CUDA error: device-side assert triggered). We are using the official ConLL 2003 dataset.

Traceback (most recent call last):
  File "train_with_teacher.py", line 246, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 294, in train
    train_data=self.assign_pretrained_teacher_targets(coupled_train_data,self.teachers,best_k=best_k)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 750, in assign_pretrained_teacher_targets
    logits=teacher.forward(teacher_input)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 665, in forward
    self.embeddings.embed(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 178, in embed
    embedding.embed(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 90, in embed
    self._add_embeddings_internal(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 3776, in _add_embeddings_internal
    all_encoder_layers = self.model(all_input_ids, attention_mask=all_input_masks)[
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 712, in forward
    embedding_output = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/pytorch_transformers/modeling_bert.py", line 268, in forward
    embeddings = words_embeddings + position_embeddings + token_type_embeddings

This seems to be calling the _add_embeddings_internal() function in the BertEmbeddings class although I have specified TransformerWordEmbeddings in the config files.

The config file I shared with you is the one for training the DE Teacher monolingual model with TransformerWordEmbeddings. I was not able to run it due to the following error:

2021-02-13 13:11:32,986 ----------------------------------------------------------------------------------------------------
2021-02-13 13:11:32,986 Corpus: "Corpus: 12705 train + 3068 dev + 3160 test sentences"
2021-02-13 13:11:32,986 ----------------------------------------------------------------------------------------------------
2021-02-13 13:11:32,986 Parameters:
2021-02-13 13:11:32,986  - learning_rate: "0.1"
2021-02-13 13:11:32,986  - mini_batch_size: "2000"
2021-02-13 13:11:32,986  - patience: "10"
2021-02-13 13:11:32,986  - anneal_factor: "0.5"
2021-02-13 13:11:32,986  - max_epochs: "300"
2021-02-13 13:11:32,986  - shuffle: "True"
2021-02-13 13:11:32,986  - train_with_dev: "False"
2021-02-13 13:11:32,986 ----------------------------------------------------------------------------------------------------
2021-02-13 13:11:32,987 Model training base path: "resources/taggers/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_nodev_ner0"
2021-02-13 13:11:32,987 ----------------------------------------------------------------------------------------------------
2021-02-13 13:11:32,987 Device: cuda:0
2021-02-13 13:11:32,987 ----------------------------------------------------------------------------------------------------
2021-02-13 13:11:32,987 Embeddings storage mode: cpu
2021-02-13 13:11:33,857 ----------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 396, in train
    loss = self.model.forward_loss(student_input)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 526, in forward_loss
    features = self.forward(data_points)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 665, in forward
    self.embeddings.embed(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 178, in embed
    embedding.embed(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 90, in embed
    self._add_embeddings_internal(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 1237, in _add_embeddings_internal
    self._add_embeddings_to_sentences(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 1354, in _add_embeddings_to_sentences
    truncation=True,
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2438, in encode_plus
    **kwargs,
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 472, in _encode_plus
    **kwargs,
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 385, in _batch_encode_plus
    is_pretokenized=is_split_into_words,
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
> /home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py(410)train()
-> torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
(Pdb) 
Traceback (most recent call last):
  File "train_with_teacher.py", line 246, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 410, in train
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 410, in train
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit

distillation_trainer

For the pretrained model, I'm sorry for the mistake. The code will still load the old class as the model is trained by the class.

For the bug of your config file, I feel it's quite strange since the bug is caused in the transformers function. Please ensure transformers==3.0.0 as higher version of transformers has a bug in tokenizer.encode_plus function (this issue).

Currently, I have two suggestions to run our code:

remove long sentences in the training set that have more than 510 subtokens and chunk the long sentences in the test set. (this is what I did in the ACL paper)
train the teacher models again by TransformerWordEmbeddings

I am unable to train the teacher models with TransformerWordEmbeddings due to the following error:

2021-02-14 00:23:16,232 ----------------------------------------------------------------------------------------------------
2021-02-14 00:23:16,232 Corpus: "Corpus: 12705 train + 3068 dev + 3160 test sentences"
2021-02-14 00:23:16,232 ----------------------------------------------------------------------------------------------------
2021-02-14 00:23:16,232 Parameters:
2021-02-14 00:23:16,232  - learning_rate: "0.1"
2021-02-14 00:23:16,232  - mini_batch_size: "2000"
2021-02-14 00:23:16,232  - patience: "10"
2021-02-14 00:23:16,232  - anneal_factor: "0.5"
2021-02-14 00:23:16,232  - max_epochs: "300"
2021-02-14 00:23:16,232  - shuffle: "True"
2021-02-14 00:23:16,232  - train_with_dev: "False"
2021-02-14 00:23:16,232 ----------------------------------------------------------------------------------------------------
2021-02-14 00:23:16,232 Model training base path: "resources/taggers/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_nodev_ner0"
2021-02-14 00:23:16,232 ----------------------------------------------------------------------------------------------------
2021-02-14 00:23:16,232 Device: cuda:0
2021-02-14 00:23:16,232 ----------------------------------------------------------------------------------------------------
2021-02-14 00:23:16,232 Embeddings storage mode: cpu
2021-02-14 00:23:17,075 ----------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 396, in train
    loss = self.model.forward_loss(student_input)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 526, in forward_loss
    features = self.forward(data_points)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 665, in forward
    self.embeddings.embed(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 178, in embed
    embedding.embed(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 90, in embed
    self._add_embeddings_internal(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 1236, in _add_embeddings_internal
    self._add_embeddings_to_sentences(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 1353, in _add_embeddings_to_sentences
    truncation=True,
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2438, in encode_plus
    **kwargs,
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 472, in _encode_plus
    **kwargs,
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 385, in _batch_encode_plus
    is_pretokenized=is_split_into_words,
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
> /home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py(410)train()
-> torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
(Pdb) 
Traceback (most recent call last):
  File "train_with_teacher.py", line 246, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 410, in train
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 410, in train
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit

I am unable to train the teacher models with TransformerWordEmbeddings due to the following error:

2021-02-14 00:23:16,232 ----------------------------------------------------------------------------------------------------
2021-02-14 00:23:16,232 Corpus: "Corpus: 12705 train + 3068 dev + 3160 test sentences"
2021-02-14 00:23:16,232 ----------------------------------------------------------------------------------------------------
2021-02-14 00:23:16,232 Parameters:
2021-02-14 00:23:16,232  - learning_rate: "0.1"
2021-02-14 00:23:16,232  - mini_batch_size: "2000"
2021-02-14 00:23:16,232  - patience: "10"
2021-02-14 00:23:16,232  - anneal_factor: "0.5"
2021-02-14 00:23:16,232  - max_epochs: "300"
2021-02-14 00:23:16,232  - shuffle: "True"
2021-02-14 00:23:16,232  - train_with_dev: "False"
2021-02-14 00:23:16,232 ----------------------------------------------------------------------------------------------------
2021-02-14 00:23:16,232 Model training base path: "resources/taggers/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_nodev_ner0"
2021-02-14 00:23:16,232 ----------------------------------------------------------------------------------------------------
2021-02-14 00:23:16,232 Device: cuda:0
2021-02-14 00:23:16,232 ----------------------------------------------------------------------------------------------------
2021-02-14 00:23:16,232 Embeddings storage mode: cpu
2021-02-14 00:23:17,075 ----------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 396, in train
    loss = self.model.forward_loss(student_input)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 526, in forward_loss
    features = self.forward(data_points)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 665, in forward
    self.embeddings.embed(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 178, in embed
    embedding.embed(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 90, in embed
    self._add_embeddings_internal(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 1236, in _add_embeddings_internal
    self._add_embeddings_to_sentences(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 1353, in _add_embeddings_to_sentences
    truncation=True,
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2438, in encode_plus
    **kwargs,
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 472, in _encode_plus
    **kwargs,
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 385, in _batch_encode_plus
    is_pretokenized=is_split_into_words,
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
> /home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py(410)train()
-> torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
(Pdb) 
Traceback (most recent call last):
  File "train_with_teacher.py", line 246, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 410, in train
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 410, in train
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit

Please ensure transformers==3.0.0, as higher version of transformers has a bug in tokenizer.encode_plus function.

Using transformers==3.0.0 with TransformerWordEmbeddings I was able to get DUTCH and GERMAN teacher models, however I wasn't able to train the English and Spanish models due to CUDA memory allocation error. Do you have an idea how to solve this ?

2021-02-14 19:22:37,556 ----------------------------------------------------------------------------------------------------
2021-02-14 19:22:37,556 Corpus: "Corpus: 8323 train + 1915 dev + 1517 test sentences"
2021-02-14 19:22:37,556 ----------------------------------------------------------------------------------------------------
2021-02-14 19:22:37,556 Parameters:
2021-02-14 19:22:37,556  - learning_rate: "0.1"
2021-02-14 19:22:37,556  - mini_batch_size: "2000"
2021-02-14 19:22:37,556  - patience: "10"
2021-02-14 19:22:37,556  - anneal_factor: "0.5"
2021-02-14 19:22:37,556  - max_epochs: "300"
2021-02-14 19:22:37,557  - shuffle: "True"
2021-02-14 19:22:37,557  - train_with_dev: "False"
2021-02-14 19:22:37,557 ----------------------------------------------------------------------------------------------------
2021-02-14 19:22:37,557 Model training base path: "resources/taggers/multi_bert_origflair_300epoch_2000batch_1lr_256hidden_es_monolingual_crf_sentloss_10patience_baseline_nodev_ner1"
2021-02-14 19:22:37,557 ----------------------------------------------------------------------------------------------------
2021-02-14 19:22:37,557 Device: cuda:0
2021-02-14 19:22:37,557 ----------------------------------------------------------------------------------------------------
2021-02-14 19:22:37,557 Embeddings storage mode: cpu
2021-02-14 19:22:38,215 ----------------------------------------------------------------------------------------------------
2021-02-14 19:22:39,766 epoch 1 - iter 0/134 - loss 174.48507690 - samples/sec: 24.51 - decode_sents/sec: 932.01
2021-02-14 19:23:00,533 epoch 1 - iter 13/134 - loss 45.69557244 - samples/sec: 39.05 - decode_sents/sec: 20251.66
2021-02-14 19:23:19,449 epoch 1 - iter 26/134 - loss 35.56369404 - samples/sec: 38.02 - decode_sents/sec: 18606.88
2021-02-14 19:23:41,606 epoch 1 - iter 39/134 - loss 32.26693780 - samples/sec: 34.39 - decode_sents/sec: 18411.52
2021-02-14 19:24:00,807 epoch 1 - iter 52/134 - loss 28.20729213 - samples/sec: 43.77 - decode_sents/sec: 18713.17
2021-02-14 19:24:19,737 epoch 1 - iter 65/134 - loss 25.30822932 - samples/sec: 35.33 - decode_sents/sec: 17169.03
2021-02-14 19:24:38,643 epoch 1 - iter 78/134 - loss 22.96082053 - samples/sec: 32.13 - decode_sents/sec: 14861.81
2021-02-14 19:25:04,029 epoch 1 - iter 91/134 - loss 26.54964065 - samples/sec: 22.15 - decode_sents/sec: 12904.13
2021-02-14 19:25:23,001 epoch 1 - iter 104/134 - loss 24.48356327 - samples/sec: 36.26 - decode_sents/sec: 18324.86
2021-02-14 19:25:42,105 epoch 1 - iter 117/134 - loss 22.68886979 - samples/sec: 31.69 - decode_sents/sec: 16187.06
Traceback (most recent call last):
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 396, in train
    loss = self.model.forward_loss(student_input)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 526, in forward_loss
    features = self.forward(data_points)
  File "/home/michael1441/projects/MultilangStructureKD/flair/models/sequence_tagger_model.py", line 665, in forward
    self.embeddings.embed(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 178, in embed
    embedding.embed(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 90, in embed
    self._add_embeddings_internal(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 1236, in _add_embeddings_internal
    self._add_embeddings_to_sentences(sentences)
  File "/home/michael1441/projects/MultilangStructureKD/flair/embeddings.py", line 1395, in _add_embeddings_to_sentences
    hidden_states = self.model(input_ids, attention_mask=mask)[-1]
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/transformers/modeling_bert.py", line 762, in forward
    output_hidden_states=output_hidden_states,
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/transformers/modeling_bert.py", line 439, in forward
    output_attentions,
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/transformers/modeling_bert.py", line 371, in forward
    hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/transformers/modeling_bert.py", line 315, in forward
    hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/site-packages/transformers/modeling_bert.py", line 239, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: CUDA out of memory. Tried to allocate 246.00 MiB (GPU 0; 31.72 GiB total capacity; 30.18 GiB already allocated; 229.56 MiB free; 255.88 MiB cached)
> /home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py(410)train()
-> torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
(Pdb) 
Traceback (most recent call last):
  File "train_with_teacher.py", line 246, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 410, in train
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
  File "/home/michael1441/projects/MultilangStructureKD/flair/trainers/distillation_trainer.py", line 410, in train
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/bdb.py", line 88, in trace_dispatch
    return self.dispatch_line(frame)
  File "/home/michael1441/.conda/envs/KD/lib/python3.7/bdb.py", line 113, in dispatch_line
    if self.quitting: raise BdbQuit
bdb.BdbQuit

The bug is caused by GPU out of memory. You may try a smaller batch size for the problem.

Hi, I was able to run all experiments using TransformerWordEmbeddings, thanks a lot for your help. I was wondering if you forgot to upload the config file for Pos. + Top-WK ?

Thanks for your reminder, I have uploaded the config and update the guide for Pos.+Top-WK KD.

Alibaba-NLP / MultilangStructureKD

Missing config_gen yml files #4