Why not filter out the [PAD] tokens when computing loss?

dmis-lab / biobert

Bioinformatics'2020: BioBERT: a pre-trained biomedical language representation model for biomedical text mining

http://doi.org/10.1093/bioinformatics/btz682

Other

1.93k stars 451 forks source link

Why not filter out the [PAD] tokens when computing loss? #174

Open zdgithub opened 2 years ago

zdgithub commented 2 years ago

I found that the losses of the [PAD] tokens were also computed when training the model. But in fact, we should filter out them as https://github.com/kyzhouhzau/BERT-NER/blob/master/BERT_NER.py does. Why didn't you mask the losses of the [PAD] tokens?