facebookresearch / CodeGen

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.
MIT License
710 stars 144 forks source link

How to get tokenizer for COBOL #48

Closed KAUSTIKR closed 2 years ago

KAUSTIKR commented 2 years ago

How to create tokenizer for COBOL program

AlexShypula commented 2 years ago

In terms of adding support for a new language, this advice seems great : https://github.com/facebookresearch/CodeGen/issues/42#issuecomment-948737032.

Insofar as implementing a tokenizer for Cobol, in theory, any Cobol parser or compiler should have a lexer inside of it that will tokenize the code as an intermediate step. You'll probably have to find a good open source tool that can help you out the gate, this could be one: https://pypi.org/project/pygments-ibm-cobol-lexer/

Then you probably will have to make some modifications to be compatible with the lang_processor classes here. https://github.com/facebookresearch/CodeGen/blob/main/codegen_sources/preprocessing/lang_processors/python_processor.py