Closed KAUSTIKR closed 2 years ago
In terms of adding support for a new language, this advice seems great : https://github.com/facebookresearch/CodeGen/issues/42#issuecomment-948737032.
Insofar as implementing a tokenizer for Cobol, in theory, any Cobol parser or compiler should have a lexer inside of it that will tokenize the code as an intermediate step. You'll probably have to find a good open source tool that can help you out the gate, this could be one: https://pypi.org/project/pygments-ibm-cobol-lexer/
Then you probably will have to make some modifications to be compatible with the lang_processor classes here. https://github.com/facebookresearch/CodeGen/blob/main/codegen_sources/preprocessing/lang_processors/python_processor.py
How to create tokenizer for COBOL program