Tensorflow documentation says the EMA variables are created with (trainable=False) and added to the GraphKeys.ALL_VARIABLES collection. Now as they are not trainable they wont have the gradient applied on them, i understand that. But, as they depend upon the current trainable variables of the graph, and hence so do the predictions of the teacher network; an additional gradient will flow to the trainable variables because of ema being dependent upon them. Is this correct understnading of implementation?
Great paper!
Tensorflow documentation says the EMA variables are created with (trainable=False) and added to the GraphKeys.ALL_VARIABLES collection. Now as they are not trainable they wont have the gradient applied on them, i understand that. But, as they depend upon the current trainable variables of the graph, and hence so do the predictions of the teacher network; an additional gradient will flow to the trainable variables because of ema being dependent upon them. Is this correct understnading of implementation?