I found that the losses of the [PAD] tokens were also computed when training the model. But in fact, we should filter out them as https://github.com/kyzhouhzau/BERT-NER/blob/master/BERT_NER.py does. Why didn't you mask the losses of the [PAD] tokens?
I found that the losses of the [PAD] tokens were also computed when training the model. But in fact, we should filter out them as https://github.com/kyzhouhzau/BERT-NER/blob/master/BERT_NER.py does. Why didn't you mask the losses of the [PAD] tokens?