CodeQwen1.5模型支持跨文件级别的infilling续写吗？

Lanyu123 commented 2 months ago

          https://github.com/QwenLM/CodeQwen1.5?tab=readme-ov-file#3-repository-level-code-completion

Originally posted by @huybery in https://github.com/QwenLM/CodeQwen1.5/issues/24#issuecomment-2069027936

Lanyu123 commented 2 months ago

你好，可能我的问题没有描述清楚。你链接给的是对跨文件级别（repository level）的代码文件的续写，例子中的当前续写文件只有上半段，没有下半段，不是infilling的续写方式。我寻求的是在repository level级别的文件中，对当前文件做infilling代码续写，既要考虑跨文件内容，也要考虑当前文件的上下文，即在repository level+infilling的代码续写方式，这种要怎么组建prompt呢？文档里没有给出例子，我尝试用以下的prompt构建方式：

input_text = """<fim_prefix><reponame>library-system
<file_sep>library.py
class Book:
    def __init__(self, title, author, isbn, copies):
        self.title = title
        self.author = author
        self.isbn = isbn
        self.copies = copies

    def __str__(self):
        return f"Title: {self.title}, Author: {self.author}, ISBN: {self.isbn}, Copies: {self.copies}"

class Library:
    def __init__(self):
        self.books = []

    def add_book(self, title, author, isbn, copies):
        book = Book(title, author, isbn, copies)
        self.books.append(book)

    def find_book(self, isbn):
        for book in self.books:
            if book.isbn == isbn:
                return book
        return None

    def list_books(self):
        return self.books

<file_sep>student.py
class Student:
    def __init__(self, name, id):
        self.name = name
        self.id = id
        self.borrowed_books = []

    def borrow_book(self, book, library):
        if book and book.copies > 0:
            self.borrowed_books.append(book)
            book.copies -= 1
            return True
        return False

    def return_book(self, book, library):
        if book in self.borrowed_books:
            self.borrowed_books.remove(book)
            book.copies += 1
            return True
        return False

<file_sep>main.py
from library import Library
from student import Student

def main():
    # Set up the library with some books
    library = Library()
    library.add_book("The Great Gatsby", "F. Scott Fitzgerald", "1234567890", 3)
    library.add_book("To Kill a Mockingbird", "Harper Lee", "1234567891", 2)

    # Set up a student
    student = Student("Alice", "S1")

    # Student borrows a book<fim_suffix>
    if student.borrow_book(book, library):
        print(f"{student.name} borrowed {book.title}")
    else:
        print(f"{student.name} could not borrow {book.title}")

    # Student returns a book
    if student.return_book(book, library):
        print(f"{student.name} returned {book.title}")
    else:
        print(f"{student.name} could not return {book.title}")

    # List all books in the library
    print("All books in the library:")
    for book in library.list_books():
        print(book)

if __name__ == "__main__":
    main()<fim_middle>
"""

但是似乎模型并不奏效，请问模型支持这种repository level+infilling的续写方式吗？我该怎么构建prompt呢？望请回复，十分感谢！

mechigonft commented 2 months ago

确实，我也想问这个问题，续写是只有上文信息，没有下文信息，而fill-in-the-middle模式，是基于上文和下文预测代码

cyente commented 2 months ago

跨文件级别的

你好，可能我的问题没有描述清楚。你链接给的是对跨文件级别（repository level）的代码文件的续写，例子中的当前续写文件只有上半段，没有下半段，不是infilling的续写方式。我寻求的是在repository level级别的文件中，对当前文件做infilling代码续写，既要考虑跨文件内容，也要考虑当前文件的上下文，即在repository level+infilling的代码续写方式，这种要怎么组建prompt呢？文档里没有给出例子，我尝试用以下的prompt构建方式：

input_text = """<fim_prefix><reponame>library-system
<file_sep>library.py
class Book:
    def __init__(self, title, author, isbn, copies):
        self.title = title
        self.author = author
        self.isbn = isbn
        self.copies = copies

    def __str__(self):
        return f"Title: {self.title}, Author: {self.author}, ISBN: {self.isbn}, Copies: {self.copies}"

class Library:
    def __init__(self):
        self.books = []

    def add_book(self, title, author, isbn, copies):
        book = Book(title, author, isbn, copies)
        self.books.append(book)

    def find_book(self, isbn):
        for book in self.books:
            if book.isbn == isbn:
                return book
        return None

    def list_books(self):
        return self.books

<file_sep>student.py
class Student:
    def __init__(self, name, id):
        self.name = name
        self.id = id
        self.borrowed_books = []

    def borrow_book(self, book, library):
        if book and book.copies > 0:
            self.borrowed_books.append(book)
            book.copies -= 1
            return True
        return False

    def return_book(self, book, library):
        if book in self.borrowed_books:
            self.borrowed_books.remove(book)
            book.copies += 1
            return True
        return False

<file_sep>main.py
from library import Library
from student import Student

def main():
    # Set up the library with some books
    library = Library()
    library.add_book("The Great Gatsby", "F. Scott Fitzgerald", "1234567890", 3)
    library.add_book("To Kill a Mockingbird", "Harper Lee", "1234567891", 2)

    # Set up a student
    student = Student("Alice", "S1")

    # Student borrows a book<fim_suffix>
    if student.borrow_book(book, library):
        print(f"{student.name} borrowed {book.title}")
    else:
        print(f"{student.name} could not borrow {book.title}")

    # Student returns a book
    if student.return_book(book, library):
        print(f"{student.name} returned {book.title}")
    else:
        print(f"{student.name} could not return {book.title}")

    # List all books in the library
    print("All books in the library:")
    for book in library.list_books():
        print(book)

if __name__ == "__main__":
    main()<fim_middle>
"""

但是似乎模型并不奏效，请问模型支持这种repository level+infilling的续写方式吗？我该怎么构建prompt呢？望请回复，十分感谢！

跨文件级别的infilling的格式，模型是支持的，我们后续会将这个样例加入example。

具体格式上，<fim_prefix>指示的是需要infilling的文件的上文，因此格式如下：

input_text = """<reponame>library-system
<file_sep>library.py
class Book:
    def __init__(self, title, author, isbn, copies):
        self.title = title
        self.author = author
        self.isbn = isbn
        self.copies = copies

    def __str__(self):
        return f"Title: {self.title}, Author: {self.author}, ISBN: {self.isbn}, Copies: {self.copies}"

class Library:
    def __init__(self):
        self.books = []

    def add_book(self, title, author, isbn, copies):
        book = Book(title, author, isbn, copies)
        self.books.append(book)

    def find_book(self, isbn):
        for book in self.books:
            if book.isbn == isbn:
                return book
        return None

    def list_books(self):
        return self.books

<file_sep>student.py
class Student:
    def __init__(self, name, id):
        self.name = name
        self.id = id
        self.borrowed_books = []

    def borrow_book(self, book, library):
        if book and book.copies > 0:
            self.borrowed_books.append(book)
            book.copies -= 1
            return True
        return False

    def return_book(self, book, library):
        if book in self.borrowed_books:
            self.borrowed_books.remove(book)
            book.copies += 1
            return True
        return False

<file_sep>main.py
<fim_prefix>from library import Library
from student import Student

def main():
    # Set up the library with some books
    library = Library()
    library.add_book("The Great Gatsby", "F. Scott Fitzgerald", "1234567890", 3)
    library.add_book("To Kill a Mockingbird", "Harper Lee", "1234567891", 2)

    # Set up a student
    student = Student("Alice", "S1")

    # Student borrows a book<fim_suffix>
    if student.borrow_book(book, library):
        print(f"{student.name} borrowed {book.title}")
    else:
        print(f"{student.name} could not borrow {book.title}")

    # Student returns a book
    if student.return_book(book, library):
        print(f"{student.name} returned {book.title}")
    else:
        print(f"{student.name} could not return {book.title}")

    # List all books in the library
    print("All books in the library:")
    for book in library.list_books():
        print(book)

if __name__ == "__main__":
    main()<fim_middle>
"""

模型期望生成结果如下:

Generated text:     book = library.find_book("1234567890")

Lanyu123 commented 2 months ago

好的明白，感谢答疑

mechigonft commented 2 months ago

请问，你们是否支持2个注释之间的代码段生成？比如

// 注释1 // 注释2 这样的话，模型是不是能够做到只生成注释1后续的代码段，而不会直接生成到方法的最后？

cyente commented 2 months ago

只要是符合fim结构的的格式理论上都支持，具体实践效果需要尝试

mechigonft commented 2 months ago

@cyente 你好，我刚刚测试了一下我说的“根据2个注释生成中间代码”的case，效果有好的一面有坏的一面，首先看一下我的生成结果：好的一方面：这行查询的代码我觉得生成的非常不错，质量很高，基本可以直接用：List couponInstanceList = couponInstanceDAO.getCouponByAccountNo(listCouponBySpec.getMerchantId(), distributeSource, CouponStatusEnum.UN_USE.getStatus()); 坏的一方面：我期望生成的是两段注释中间的代码，也就是说，我其实只想要“查询”逻辑的代码，没想到，模型，直接还给我返回了第二个注释的代码，也就是“校验”的代码从这个现象中，我发现模型倾向于生成“从fim_prefix到整个方法结束的代码”，而不会准确识别我只是希望它生成到我指定的下文fim_suffix的位置。从生成的代码中可以看出，模型会做很多“多余的工作”

mechigonft commented 2 months ago

其实模型生成的代码很长，有很多多余的工作：校验、转换等，这些并不是我期望的事情，我只希望模型生成直到“校验注释”之前的代码

mechigonft commented 2 months ago

红色框是我期望生成的代码，蓝色框是模型做的多余的工作，校验、转换、其他查询.....，我看你上述给出的代码中，模型期望生成结果如下: Generated text: book = library.find_book("1234567890") 只生成了一行代码，而我这边测试发现模型倾向于生成“非常多”代码，甚至会超过maxtoken而截断

cyente commented 2 months ago

@cyente 你好，我刚刚测试了一下我说的“根据2个注释生成中间代码”的case，效果有好的一面有坏的一面，首先看一下我的生成结果：好的一方面：这行查询的代码我觉得生成的非常不错，质量很高，基本可以直接用：List couponInstanceList = couponInstanceDAO.getCouponByAccountNo(listCouponBySpec.getMerchantId(), distributeSource, CouponStatusEnum.UN_USE.getStatus()); 坏的一方面：我期望生成的是两段注释中间的代码，也就是说，我其实只想要“查询”逻辑的代码，没想到，模型，直接还给我返回了第二个注释的代码，也就是“校验”的代码从这个现象中，我发现模型倾向于生成“从fim_prefix到整个方法结束的代码”，而不会准确识别我只是希望它生成到我指定的下文fim_suffix的位置。从生成的代码中可以看出，模型会做很多“多余的工作”

看上面截图的例子里面，后面，除了注释以外，下文当中应该还包含了一些内容？

我猜测，你将suffix后的内容补上，会解决，你说的，它持续往下生成多余内容的问题；

如果还不行的话，建议采用一些后处理，比如只截断第一行的内容就可以满足需求

cyente commented 2 months ago

以及，控制最大生成长度的参数

sampling_params = SamplingParams(temperature=xx, top_p=xx, repetition_penalty=xxx, max_tokens=256)

mechigonft commented 2 months ago

@cyente 哈喽，你好，是这样的，我的fim_suffix后面直到fim_middle，确实就是只有一个注释了，没有其他代码内容，也不该有其他代码内容，因为我这是在模拟一个真实程序员的写代码的逻辑：先写注释框架，再针对每个注释补全代码段，最终完成整个代码的编写。所以，我希望模型就是根据2段注释给我生成中间代码，也就是第一个注释的代码，到第二个注释为止。

你给的截断前n行的建议是可行的，只不过工程解法，比较生硬不灵活

cyente commented 2 months ago

@cyente 哈喽，你好，是这样的，我的fim_suffix后面直到fim_middle，确实就是只有一个注释了，没有其他代码内容，也不该有其他代码内容，因为我这是在模拟一个真实程序员的写代码的逻辑：先写注释框架，再针对每个注释补全代码段，最终完成整个代码的编写。所以，我希望模型就是根据2段注释给我生成中间代码，也就是第一个注释的代码，到第二个注释为止。

你给的截断前n行的建议是可行的，只不过工程解法，比较生硬不灵活

您第二个注释是一个明确的指令，后面没有接任何代码解法，可能给模型带了一些困惑。可以多去尝试。

mechigonft commented 2 months ago

感谢答疑🙏

mechigonft commented 2 months ago

fill in middle的推理方式，支持加上instruct吗？比如，我加上指令:请参考代码上下文，只生成两段注释中间的代码

mechigonft commented 2 months ago

我理解fill in middle并不是对话式的推理方式，而是偏后台脚本解析成fill in middle的格式，而instruct/prompt这种是对话式的推理，那这两者能够融合使用吗？

mechigonft commented 2 months ago

如果可以融合的话，那我是不是可以通过指令的方式让模型知道不要生成太多的代码，生成到下一个注释为止就好

cyente commented 2 months ago

对话式模型建议使用Qwen/CodeQwen1.5-7B-Chat

mechigonft commented 2 months ago

我的意思是这样的：prompt = instruct + fill in the middle prompt：请参考我提供的代码上下文，只生成两段注释中间的代码，不要生成多余代码。

// 注释1 // 注释2 这种，能不能把指令和fill in the middle两个模式结合使用

QwenLM / CodeQwen1.5

CodeQwen1.5模型支持跨文件级别的infilling续写吗？ #25