word_vocab.json - Githubissues

DJjjjhao / FIRA-ICSE

This repository is the replication package of the ICSE22 paper "FIRA: Fine-Grained Graph-Based Code Change Representation for Automated Commit Message Generation"

31 stars 5 forks source link

word_vocab.json #4

Open vanessailana opened 1 year ago

vanessailana commented 1 year ago

I need some clarity regarding using a language model's word vocabulary from its training data. Is it essential to stick to the exact vocabulary during usage? Your insights would be much appreciated.

DJjjjhao commented 1 year ago

Do you mean how to construct the vocabulary? You can first anonymize the identifiers (e.g., class name and method name), and then filter the tokens according to the frequency threshold (e.g., 5 times) and reserse the tokens which appear more than the threshold in the training set in the final vocabulary.