Open vanessailana opened 1 year ago
Do you mean how to construct the vocabulary? You can first anonymize the identifiers (e.g., class name and method name), and then filter the tokens according to the frequency threshold (e.g., 5 times) and reserse the tokens which appear more than the threshold in the training set in the final vocabulary.
I need some clarity regarding using a language model's word vocabulary from its training data. Is it essential to stick to the exact vocabulary during usage? Your insights would be much appreciated.