CGCL-codes / naturalcc

NaturalCC: An Open-Source Toolkit for Code Intelligence
http://xcodemind.github.io
MIT License
276 stars 46 forks source link
deep-learning natural-language-processing programming-language toolkit



Version Python pytorch license Paper, Demo, About us-XCodeMind # NaturalCC - Natural Code Comprehension

πŸ“– Vision

NaturalCC is a sequence modeling toolkit designed to bridge the gap between programming and natural languages through advanced machine learning techniques. It allows researchers and developers to train custom models for a variety of software engineering tasks, e.g., code generation, code completion, code summarization, code retrieval, code clone detection, and type inference.

🌟 Key Features:

✨ Latest News

πŸ› οΈ Installation Guide

To get started with NaturalCC, ensure your system meets the following requirements:

Follow these steps to set up the environment.

  1. (Optional) Creating conda environment

    conda create -n naturalcc python=3.6
    conda activate naturalcc
  2. Building NaturalCC from source

    git clone https://github.com/CGCL-codes/naturalcc && cd naturalcc
    pip install -r requirements.txt
    cd src
    pip install --editable ./
  3. Installing Additional Dependencies

    conda install conda-forge::libsndfile
    pip install -q -U git+https://github.com/huggingface/transformers.git
    pip install -q -U git+https://github.com/huggingface/accelerate.git
  4. HuggingFace Token for Certain Models

    For models like StarCoder, a HuggingFace token is required. Log in to HuggingFace using:

    huggingface-cli login

πŸš€ Quick Start

Example 1: Code Generation

  1. Download the model checkpoint

    First, download the checkpoint of a specific large code model. For this example, we use Codellama-7B.

  2. Prepare the testing dataset

    Create a JSON file containing your test cases in the following format:

    [
      {"input": "this is a"},
      {"input": "from tqdm import"},
      {"input": "def calculate("},
      {"input": "a = b**2"},
      {"input": "torch.randint"},
      {"input": "x = [1,2"}
    ]
  3. Running the code generation scripts

    1. Initialize the task with the specific model and GPU device:

      print('Initializing GenerationTask')
      task = GenerationTask(task_name="codellama_7b_code", device="cuda:0")
    2. Load the downloaded checkpoint into the task. Replace ckpt_path with the path to your downloaded checkpoint:

      print('Loading model weights [{}]'.format(ckpt_path))
      task.from_pretrained(ckpt_path)
    3. Load your dataset. Replace dataset_path with the path to your dataset file:

      print('Processing dataset [{}]'.format(dataset_path))
      task.load_dataset(dataset_path)
    4. Run the model and output the results. Replace output_path with your desired output file path:

      task.run(output_path=output_path, batch_size=1, max_length=50)
      print('Output file: {}'.format(output_path))

Example 2: Code Summarization

  1. Download and process a dataset from datasets, and follow the instructions from the README.md file.

    # ref: dataset/python_wan/README.md
    # download dataset
    bash dataset/python_wan/download.sh
    # clean data
    python -m dataset.python_wan.clean
    # cast data attributes into different files
    python -m dataset.python_wan.attributes_cast
    
    # ref: dataset/python_wan/summarization/README.md
    # save code tokens and docstirng tokens into MMAP format
    python -m dataset.python_wan.summarization.preprocess
  2. Register your self-defined models

    • If you want to create a new model, please add your model at ncc/models and ncc/modules.

    • If your training policy are more complex than we thought, you should update your criterions and training procedure at ncc/criterions and ncc/trainers, respectively.

      Do not forget to update your self defined module at ncc/XX/__init__.py.

  3. Training and inference.

    • Select a task and a model from task list and follow the instructions in its README.md to start your learning.
      # ref: run/summarization/transformer/README.md
      # train
      CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python -m run.summarization.transformer.train -f config/python_wan/python > run/summarization/transformer/config/python_wan/python.log 2>&1 &
      # inference
      CUDA_VISIBLE_DEVICES=0 python -m run.summarization.transformer.eval -f config/python_wan/python -o run/summarization/transformer/config/python_wan/python.txt

We also have more detailed READMEs to start your tutorial of NaturalCC.

πŸ“š Dataset

NaturalCC supports a diverse range of datasets, catering to various aspects of code analysis and processing. These datasets include:

🀝 Contributor

We warmly welcome contributions to NaturalCC! Your involvement is essential for keeping NaturalCC innovative and accessible.

We're grateful to all our amazing contributors who have made this project what it is today!

πŸ’‘ FAQ

If you have any questions or encounter issues, please feel free to reach out. For quick queries, you can also check our Issues page for common questions and solutions.

😘 License and Acknowledgement

License: NaturalCC is open-sourced under the MIT-licensed. This permissive license applies not only to the toolkit itself but also to the pre-trained models provided within.

Acknowledgements: We extend our heartfelt gratitude to the broader open-source community, particularly drawing inspiration from projects like Fairseq for their advanced sequence-to-sequence models, and AllenNLP for their robust NLP components. Their groundbreaking work has been instrumental in shaping the development of NaturalCC.

πŸ“„ Citation

We're thrilled that you're interested in using NaturalCC for your research or applications! Citing our work helps us to grow and continue improving this toolkit. You can find more in-depth details about NaturalCC in our paper.

If you use NaturalCC in your research, please consider citing our paper. Below is the BibTex format for citation:

@inproceedings{wan2022naturalcc,
  title={NaturalCC: An Open-Source Toolkit for Code Intelligence},
  author={Yao Wan and Yang He and Zhangqian Bi and Jianguo Zhang and Yulei Sui and Hongyu Zhang and Kazuma Hashimoto and Hai Jin and Guandong Xu and Caiming Xiong and Philip S. Yu},
  booktitle={Proceedings of 44th International Conference on Software Engineering, Companion Volume},
  publisher=ACM,
  year={2022}
}