QwenLM / CodeQwen1.5

CodeQwen1.5 is the code version of Qwen, the large language model series developed by Qwen team, Alibaba Cloud.
384 stars 22 forks source link

CodeQwen1.5模型支持跨文件级别的infilling续写吗? #25

Closed Lanyu123 closed 2 months ago

Lanyu123 commented 2 months ago
          https://github.com/QwenLM/CodeQwen1.5?tab=readme-ov-file#3-repository-level-code-completion

Originally posted by @huybery in https://github.com/QwenLM/CodeQwen1.5/issues/24#issuecomment-2069027936

Lanyu123 commented 2 months ago

image

Lanyu123 commented 2 months ago

你好,可能我的问题没有描述清楚。你链接给的是对跨文件级别(repository level)的代码文件的续写,例子中的当前续写文件只有上半段,没有下半段,不是infilling的续写方式。我寻求的是在repository level级别的文件中,对当前文件做infilling代码续写,既要考虑跨文件内容,也要考虑当前文件的上下文,即在repository level+infilling的代码续写方式,这种要怎么组建prompt呢?文档里没有给出例子,我尝试用以下的prompt构建方式:

input_text = """<fim_prefix><reponame>library-system
<file_sep>library.py
class Book:
    def __init__(self, title, author, isbn, copies):
        self.title = title
        self.author = author
        self.isbn = isbn
        self.copies = copies

    def __str__(self):
        return f"Title: {self.title}, Author: {self.author}, ISBN: {self.isbn}, Copies: {self.copies}"

class Library:
    def __init__(self):
        self.books = []

    def add_book(self, title, author, isbn, copies):
        book = Book(title, author, isbn, copies)
        self.books.append(book)

    def find_book(self, isbn):
        for book in self.books:
            if book.isbn == isbn:
                return book
        return None

    def list_books(self):
        return self.books

<file_sep>student.py
class Student:
    def __init__(self, name, id):
        self.name = name
        self.id = id
        self.borrowed_books = []

    def borrow_book(self, book, library):
        if book and book.copies > 0:
            self.borrowed_books.append(book)
            book.copies -= 1
            return True
        return False

    def return_book(self, book, library):
        if book in self.borrowed_books:
            self.borrowed_books.remove(book)
            book.copies += 1
            return True
        return False

<file_sep>main.py
from library import Library
from student import Student

def main():
    # Set up the library with some books
    library = Library()
    library.add_book("The Great Gatsby", "F. Scott Fitzgerald", "1234567890", 3)
    library.add_book("To Kill a Mockingbird", "Harper Lee", "1234567891", 2)

    # Set up a student
    student = Student("Alice", "S1")

    # Student borrows a book<fim_suffix>
    if student.borrow_book(book, library):
        print(f"{student.name} borrowed {book.title}")
    else:
        print(f"{student.name} could not borrow {book.title}")

    # Student returns a book
    if student.return_book(book, library):
        print(f"{student.name} returned {book.title}")
    else:
        print(f"{student.name} could not return {book.title}")

    # List all books in the library
    print("All books in the library:")
    for book in library.list_books():
        print(book)

if __name__ == "__main__":
    main()<fim_middle>
"""

但是似乎模型并不奏效,请问模型支持这种repository level+infilling的续写方式吗?我该怎么构建prompt呢?望请回复,十分感谢!

mechigonft commented 2 months ago

确实,我也想问这个问题,续写是只有上文信息,没有下文信息,而fill-in-the-middle模式,是基于上文和下文预测代码

cyente commented 2 months ago

跨文件级别的

你好,可能我的问题没有描述清楚。你链接给的是对跨文件级别(repository level)的代码文件的续写,例子中的当前续写文件只有上半段,没有下半段,不是infilling的续写方式。我寻求的是在repository level级别的文件中,对当前文件做infilling代码续写,既要考虑跨文件内容,也要考虑当前文件的上下文,即在repository level+infilling的代码续写方式,这种要怎么组建prompt呢?文档里没有给出例子,我尝试用以下的prompt构建方式:

input_text = """<fim_prefix><reponame>library-system
<file_sep>library.py
class Book:
    def __init__(self, title, author, isbn, copies):
        self.title = title
        self.author = author
        self.isbn = isbn
        self.copies = copies

    def __str__(self):
        return f"Title: {self.title}, Author: {self.author}, ISBN: {self.isbn}, Copies: {self.copies}"

class Library:
    def __init__(self):
        self.books = []

    def add_book(self, title, author, isbn, copies):
        book = Book(title, author, isbn, copies)
        self.books.append(book)

    def find_book(self, isbn):
        for book in self.books:
            if book.isbn == isbn:
                return book
        return None

    def list_books(self):
        return self.books

<file_sep>student.py
class Student:
    def __init__(self, name, id):
        self.name = name
        self.id = id
        self.borrowed_books = []

    def borrow_book(self, book, library):
        if book and book.copies > 0:
            self.borrowed_books.append(book)
            book.copies -= 1
            return True
        return False

    def return_book(self, book, library):
        if book in self.borrowed_books:
            self.borrowed_books.remove(book)
            book.copies += 1
            return True
        return False

<file_sep>main.py
from library import Library
from student import Student

def main():
    # Set up the library with some books
    library = Library()
    library.add_book("The Great Gatsby", "F. Scott Fitzgerald", "1234567890", 3)
    library.add_book("To Kill a Mockingbird", "Harper Lee", "1234567891", 2)

    # Set up a student
    student = Student("Alice", "S1")

    # Student borrows a book<fim_suffix>
    if student.borrow_book(book, library):
        print(f"{student.name} borrowed {book.title}")
    else:
        print(f"{student.name} could not borrow {book.title}")

    # Student returns a book
    if student.return_book(book, library):
        print(f"{student.name} returned {book.title}")
    else:
        print(f"{student.name} could not return {book.title}")

    # List all books in the library
    print("All books in the library:")
    for book in library.list_books():
        print(book)

if __name__ == "__main__":
    main()<fim_middle>
"""

但是似乎模型并不奏效,请问模型支持这种repository level+infilling的续写方式吗?我该怎么构建prompt呢?望请回复,十分感谢!

跨文件级别的infilling的格式,模型是支持的,我们后续会将这个样例加入example。

具体格式上,<fim_prefix>指示的是需要infilling的文件的上文,因此格式如下:

input_text = """<reponame>library-system
<file_sep>library.py
class Book:
    def __init__(self, title, author, isbn, copies):
        self.title = title
        self.author = author
        self.isbn = isbn
        self.copies = copies

    def __str__(self):
        return f"Title: {self.title}, Author: {self.author}, ISBN: {self.isbn}, Copies: {self.copies}"

class Library:
    def __init__(self):
        self.books = []

    def add_book(self, title, author, isbn, copies):
        book = Book(title, author, isbn, copies)
        self.books.append(book)

    def find_book(self, isbn):
        for book in self.books:
            if book.isbn == isbn:
                return book
        return None

    def list_books(self):
        return self.books

<file_sep>student.py
class Student:
    def __init__(self, name, id):
        self.name = name
        self.id = id
        self.borrowed_books = []

    def borrow_book(self, book, library):
        if book and book.copies > 0:
            self.borrowed_books.append(book)
            book.copies -= 1
            return True
        return False

    def return_book(self, book, library):
        if book in self.borrowed_books:
            self.borrowed_books.remove(book)
            book.copies += 1
            return True
        return False

<file_sep>main.py
<fim_prefix>from library import Library
from student import Student

def main():
    # Set up the library with some books
    library = Library()
    library.add_book("The Great Gatsby", "F. Scott Fitzgerald", "1234567890", 3)
    library.add_book("To Kill a Mockingbird", "Harper Lee", "1234567891", 2)

    # Set up a student
    student = Student("Alice", "S1")

    # Student borrows a book<fim_suffix>
    if student.borrow_book(book, library):
        print(f"{student.name} borrowed {book.title}")
    else:
        print(f"{student.name} could not borrow {book.title}")

    # Student returns a book
    if student.return_book(book, library):
        print(f"{student.name} returned {book.title}")
    else:
        print(f"{student.name} could not return {book.title}")

    # List all books in the library
    print("All books in the library:")
    for book in library.list_books():
        print(book)

if __name__ == "__main__":
    main()<fim_middle>
"""

模型期望生成结果如下:

Generated text:     book = library.find_book("1234567890")
Lanyu123 commented 2 months ago

好的明白,感谢答疑

mechigonft commented 2 months ago

请问,你们是否支持2个注释之间的代码段生成?比如

// 注释1 // 注释2 这样的话,模型是不是能够做到只生成注释1后续的代码段,而不会直接生成到方法的最后?
cyente commented 2 months ago

只要是符合fim结构的的格式理论上都支持,具体实践效果需要尝试

mechigonft commented 2 months ago

@cyente 你好,我刚刚测试了一下我说的“根据2个注释生成中间代码”的case,效果有好的一面有坏的一面,首先看一下我的生成结果: image 好的一方面:这行查询的代码我觉得生成的非常不错,质量很高,基本可以直接用:List couponInstanceList = couponInstanceDAO.getCouponByAccountNo(listCouponBySpec.getMerchantId(), distributeSource, CouponStatusEnum.UN_USE.getStatus()); 坏的一方面:我期望生成的是两段注释中间的代码,也就是说,我其实只想要“查询”逻辑的代码,没想到,模型,直接还给我返回了第二个注释的代码,也就是“校验”的代码 从这个现象中,我发现模型倾向于生成“从fim_prefix到整个方法结束的代码”,而不会准确识别我只是希望它生成到我指定的下文fim_suffix的位置。从生成的代码中可以看出,模型会做很多“多余的工作” image

mechigonft commented 2 months ago

其实模型生成的代码很长,有很多多余的工作:校验、转换等,这些并不是我期望的事情,我只希望模型生成直到“校验注释”之前的代码

mechigonft commented 2 months ago

红色框是我期望生成的代码,蓝色框是模型做的多余的工作,校验、转换、其他查询.....,我看你上述给出的代码中,模型期望生成结果如下: Generated text: book = library.find_book("1234567890") 只生成了一行代码,而我这边测试发现模型倾向于生成“非常多”代码,甚至会超过maxtoken而截断 image

cyente commented 2 months ago

@cyente 你好,我刚刚测试了一下我说的“根据2个注释生成中间代码”的case,效果有好的一面有坏的一面,首先看一下我的生成结果: image 好的一方面:这行查询的代码我觉得生成的非常不错,质量很高,基本可以直接用:List couponInstanceList = couponInstanceDAO.getCouponByAccountNo(listCouponBySpec.getMerchantId(), distributeSource, CouponStatusEnum.UN_USE.getStatus()); 坏的一方面:我期望生成的是两段注释中间的代码,也就是说,我其实只想要“查询”逻辑的代码,没想到,模型,直接还给我返回了第二个注释的代码,也就是“校验”的代码 从这个现象中,我发现模型倾向于生成“从fim_prefix到整个方法结束的代码”,而不会准确识别我只是希望它生成到我指定的下文fim_suffix的位置。从生成的代码中可以看出,模型会做很多“多余的工作” image

看上面截图的例子里面,后面,除了注释以外,下文当中应该还包含了一些内容?

我猜测,你将suffix后的内容补上,会解决,你说的,它持续往下生成多余内容的问题;

如果还不行的话,建议采用一些后处理,比如只截断第一行的内容就可以满足需求

cyente commented 2 months ago

以及,控制最大生成长度的参数

sampling_params = SamplingParams(temperature=xx, top_p=xx, repetition_penalty=xxx, max_tokens=256)
mechigonft commented 2 months ago

@cyente 哈喽,你好,是这样的,我的fim_suffix后面直到fim_middle,确实就是只有一个注释了,没有其他代码内容,也不该有其他代码内容,因为我这是在模拟一个真实程序员的写代码的逻辑:先写注释框架,再针对每个注释补全代码段,最终完成整个代码的编写。所以,我希望模型就是根据2段注释给我生成中间代码,也就是第一个注释的代码,到第二个注释为止。

你给的截断前n行的建议是可行的,只不过工程解法,比较生硬不灵活

cyente commented 2 months ago

@cyente 哈喽,你好,是这样的,我的fim_suffix后面直到fim_middle,确实就是只有一个注释了,没有其他代码内容,也不该有其他代码内容,因为我这是在模拟一个真实程序员的写代码的逻辑:先写注释框架,再针对每个注释补全代码段,最终完成整个代码的编写。所以,我希望模型就是根据2段注释给我生成中间代码,也就是第一个注释的代码,到第二个注释为止。

你给的截断前n行的建议是可行的,只不过工程解法,比较生硬不灵活

您第二个注释是一个明确的指令,后面没有接任何代码解法,可能给模型带了一些困惑。可以多去尝试。

mechigonft commented 2 months ago

感谢答疑🙏

mechigonft commented 2 months ago

fill in middle的推理方式,支持加上instruct吗?比如,我加上指令:请参考代码上下文,只生成两段注释中间的代码

mechigonft commented 2 months ago

我理解fill in middle并不是对话式的推理方式,而是偏后台脚本解析成fill in middle的格式,而instruct/prompt这种是对话式的推理,那这两者能够融合使用吗?

mechigonft commented 2 months ago

如果可以融合的话,那我是不是可以通过指令的方式让模型知道不要生成太多的代码,生成到下一个注释为止就好

cyente commented 2 months ago

对话式模型建议使用Qwen/CodeQwen1.5-7B-Chat

mechigonft commented 2 months ago

我的意思是这样的:prompt = instruct + fill in the middle prompt: 请参考我提供的代码上下文,只生成两段注释中间的代码,不要生成多余代码。

// 注释1 // 注释2 这种,能不能把指令和fill in the middle两个模式结合使用