I am planning to apply mean-teacher for my problem of token classification. Since adding different noise for teacher and student is really important for the approach, i am confused about how to calculate consistency cost as length of active logits would differ. for e.g. if i use synonym noise then it can happen that it increases the length of the sentence (some tokens maybe replaces by synonym of len 2) when given to teacher model and same augmentation may generate different sentence(of different length) when given to student model.
I am planning to apply mean-teacher for my problem of token classification. Since adding different noise for teacher and student is really important for the approach, i am confused about how to calculate consistency cost as length of active logits would differ. for e.g. if i use synonym noise then it can happen that it increases the length of the sentence (some tokens maybe replaces by synonym of len 2) when given to teacher model and same augmentation may generate different sentence(of different length) when given to student model.