bigcode-project / starcoder

Home of StarCoder: fine-tuning & inference!
Apache License 2.0
7.33k stars 522 forks source link

I want to fine tuning starcoder to generate json rule code. #85

Open CharellKing opened 1 year ago

CharellKing commented 1 year ago

How can I customize my dataset? this is the snippet of json.

{
  "type": "page",
  "body": {
    "type": "collapse-group",
    "activeKey": [
      "1"
    ],
    "body": [
      {
        "type": "collapse",
        "key": "1",
        "header": "title 1",
        "body": "this is content 1"
      },
      {
        "type": "collapse",
        "key": "2",
        "header": "title 2",
        "body": "this is content 2"
      },
      {
        "type": "collapse",
        "key": "3",
        "header": "title 3",
        "body": "this is content 3"
      }
    ]
  }
}
ArmelRandy commented 1 year ago

Hi. Can you elaborate? The code provided in the repo is designed to handle tasks that can be formatted as text to text. The dataset provided is expected to have 2 columns of interest. The first one (prompt) represents a task, think about a request like write a function which outputs the maximum of 3 numbers. The second column (completion) is expected to be the answer to the prompt, in the case of the previous example it would be a function which computes the maximum between 3 arguments. The setting is adapted to text-to-code, code-to-text, code-to-code etc.

sharoseali commented 1 year ago

@ArmelRandy HI. 1st thanks to starcoder community for their amazing work. I want to fine-tune the model for code generation on our custom programming language (its based on visual basic but enhanced version based on our custom software requirements). I have some codes (some are store in files, some are on github and some codes are documented as well). The confusion is about to make my dataset and its structure. In the last comment you mentioned

The dataset provided is expected to have 2 columns of interest. The first one (prompt) represents a task, think about a request like write a function which outputs the maximum of 3 numbers. The second column (completion) is expected to be the answer to the prompt, in the case of the previous example it would be a function which computes the maximum between 3 arguments. The setting is adapted to text-to-code, code-to-text, code-to-code etc.

So in order to make my dataset what exact format should we adopt. Just based on 2 columns or some other format. I tried to decode visual basic data (pt format) from your repository and read it using pandas. Here is the screenshot after saving on exel.


startCode_Data

So can you recommend what to do here to design my custom code generator which can help in resolving code sytnax and generating new code. Thanks

ArmelRandy commented 1 year ago

Hi @sharoseali. I understand that you want to fine-tune the model for code generation on your custom programming language. First, you have to know what we call pre-training. During the pre-training, the model is trained to perform the next token prediction. It is typically what you want to use if you want to teach a new programming language to starcoder for example.

For the pre-training, you'll need just one column. Look at starcoder training dataset. There is a column content that is used during the pre-training. Its content will be processed, and tokenized in order to apply the next token prediction. You'll need to create a dataset with a similar column, which will contain file contents, issues contents, documentation etc. If you want to do sort of a pre-training (which is pretty much a fine-tuning because you'll not train starcoder from scratch) with this code, you'll have to modify it to get rid of the argument output_column_name and certainly the function prepare_sample_text etc.

This repository is about instruction fine-tuning (IFT). IFT is basically fine-tuning a model on conversations, typically a set of pairs (question, answer). e.g.

 ('write a function hello_world in python', 'def hello_world(): ...') 

This is the reason why the code (finetune.py) requires to provide a dataset with 2 columns of interest, the first for the questions and the second for the corresponding answers. If you can frame your dataset in such a format, then you'll be able to use this code at it is. Otherwise, you'll have to modify it a bit and work on just one content column. Check out Octocoder's repository , it should be working fine if you don't provide --output_column_name.

AIAnytime commented 1 year ago

How to fine-tune the "Starchat-beta" model on my dataset? There is not a single detailed explanation of fine tuning starchat-beta. Not starcoder. Can you share some link for Starchat Beta? @ArmelRandy