facebookresearch / CodeGen

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.
MIT License
710 stars 144 forks source link

Lang embeddings loading #87

Open voiteshonok opened 1 year ago

voiteshonok commented 1 year ago

The command is !python -m codegen_sources.preprocessing.preprocess data/test_dataset/ --langs cpp java python --mode=monolingual --local=True --fastbpe_vocab_path=/content/CodeGen/data/bpe/cpp-java-python/vocab --fastbpe_code_path=/content/CodeGen/data/bpe/cpp-java-python/codes --bpe_mode=fast --train_splits=1 --percent_test_valid=10 When you train Transcoder from your previous checkpoint you got such lines: INFO - 03/01/23 08:40:48 - 0:00:09 - ============ Model Reloading INFO - 03/01/23 08:40:48 - 0:00:09 - Reloading encoder from /content/drive/MyDrive/transcoder/transcoder/l2hpmxrljh/checkpoint.pth ... WARNING - 03/01/23 08:41:13 - 0:00:33 - No match found for lang cpp_sa cpp in dict_keys(['cpp_sa', 'java_sa', 'python_sa']). Initializing randomly. WARNING - 03/01/23 08:41:13 - 0:00:33 - No match found for lang java_sa java in dict_keys(['cpp_sa', 'java_sa', 'python_sa']). Initializing randomly. WARNING - 03/01/23 08:41:13 - 0:00:33 - No match found for lang python_sa python in dict_keys(['cpp_sa', 'java_sa', 'python_sa']). Initializing randomly. INFO - 03/01/23 08:41:13 - 0:00:33 - Reloading decoders from /content/drive/MyDrive/transcoder/transcoder/l2hpmxrljh/checkpoint.pth ... WARNING - 03/01/23 08:41:28 - 0:00:49 - No match found for lang cpp_sa cpp in dict_keys(['cpp_sa', 'java_sa', 'python_sa']). Initializing randomly. WARNING - 03/01/23 08:41:28 - 0:00:49 - No match found for lang java_sa java in dict_keys(['cpp_sa', 'java_sa', 'python_sa']). Initializing randomly. WARNING - 03/01/23 08:41:28 - 0:00:49 - No match found for lang python_sa python in dict_keys(['cpp_sa', 'java_sa', 'python_sa']). Initializing randomly. I guess it is not a desirable behavior, that the consequence of https://github.com/facebookresearch/CodeGen/blob/6e93aca63e7bc77287c9965a5080456326651237/codegen_sources/model/src/model/__init__.py#L414

if lang in lang_mapping:
    lang_ = lang_mapping[lang]
else:
    lang_ = lang

simple lang_ = lang lets reuse previous embeddings or smth is wrong?