Hi @frederick0329, for sequence tagging (e.g. NER) one would need to predict label for each token in the sequence per a test sample. In this case, the loss is averaged across tokens and gradients of the last FFN can still be computed. I have two questions:
Do you think TracIn computations in this case would require any significant change ?
Would approximate nearest neighbor be usable in this case ?
Would fast random approximation (Appendix F) be usable as well ?
Hi @frederick0329, for sequence tagging (e.g. NER) one would need to predict label for each token in the sequence per a test sample. In this case, the loss is averaged across tokens and gradients of the last FFN can still be computed. I have two questions: