Repo level concatenation of data

Bytes-Explorer commented 11 months ago

Can you share more details on the technique for repo level concatenation part?

guoday commented 11 months ago

We first parsing the dependencies of files, e.g. A->B, B->C, B->D. And then we rearrange the file positions based on their dependencies, e.g. A,B,C,D. For file paths, we add them on each code file as a comment. An example is shown in https://github.com/deepseek-ai/DeepSeek-Coder#4-repository-level-code-completion

Bytes-Explorer commented 11 months ago

Thank you for your response. Is this done for all languages in the data?

guoday commented 11 months ago

only for python, java, c#, c and c++

Bytes-Explorer commented 11 months ago

Thank you @guoday

Bytes-Explorer commented 11 months ago

@guoday Do you then do repo level dedup for all programming languages or just the above languages?

guoday commented 11 months ago

just the above languages. Other languages employ file level dedup.

Bytes-Explorer commented 11 months ago

@guoday Thank you for your prompt responses. I was curious if you did any ablation studies/evaluations to understand if repo level concatenation helped the model performance in a significant way.

guoday commented 11 months ago

Not yet. We will try to evaluate the model on repo-level benchmark. For function-level benchmark, the repo level concatenation doesn't help or hurt the model performance.

Bytes-Explorer commented 11 months ago

Do you have your own repo level benchmark or use a standard one?

guoday commented 11 months ago

We will use public datasets like RepoCoder and CrossCodeEval to evaluate.

Bytes-Explorer commented 11 months ago

Ok thanks, was aware of those. Once again, appreciate your prompt responses. I look forward to reading the technical report from your group. Thanks!

Casi11as commented 11 months ago

temp

Hello, I would like to know the details of the concatenation of data. Assume that the structure of parsed dependencies is in the picture, what is the concatenation results? Is it ACF,ADF,ADG,BCF,BDF,BDG,BE? 7 pieces?

guoday commented 11 months ago

First, we select the file with the smallest incoming degree, and if there are multiple files with the smallest incoming degree, we randomly choose one. This process is repeated until a dependency order is obtained. For your example, there are many possibilities, one of which could be BACDFGE.

Casi11as commented 11 months ago

First, we select the file with the smallest incoming degree, and if there are multiple files with the smallest incoming degree, we randomly choose one. This process is repeated until a dependency order is obtained. For your example, there are many possibilities, one of which could be BACDFGE.

In other words, will all the files of the same language in a repo only concatenate one sample?

guoday commented 11 months ago

Theoretically, yes. However, to shorten the sample length, we will parse a repository in advance and then divide it into multiple independent subgraphs based on dependencies, with each independent subgraph regarding as a sample.

Casi11as commented 11 months ago

Thanks ! So what are the rules for dividing into subgraphs? Taking the picture I posted above as an example, what sub-pictures will it be divided into?

slamandar commented 11 months ago

Regarding repo-level concatenation, I have a related question.

In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?

If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

guoday commented 11 months ago

The term "independent subgraph" refers to a weakly connected subgraph. First, convert the directed graph into an undirected graph, and then divide the graph into multiple connected subgraphs. That is, in each subgraph, any two vertices should be connected by edges within the subgraph. In your example, it is a connected subgraph, with only one subgraph, which is itself. The following is the code to divide the graph into subgraphs.

# convert the directed graph into an undirected graph
def to_undirected(graph):
    undirected_graph = defaultdict(set)
    for node in graph:
        undirected_graph[node]
        for neighbor in graph[node]:
            undirected_graph[node].add(neighbor)
            undirected_graph[neighbor].add(node)
    return undirected_graph

# Use DFS to find all connected subgraphs.
def dfs(graph, node, visited, subgraph):
    visited[node] = True
    subgraph.add(node)
    for neighbor in graph[node]:
        if not visited[neighbor]:
            dfs(graph, neighbor, visited, subgraph)

# obtain all subgraphs
def get_subgraphs(graph):
    undirected_graph = to_undirected(graph)
    visited = {node: False for node in undirected_graph}
    subgraphs = []
    for node in undirected_graph:
        if not visited[node]:
            subgraph = set()
            dfs(undirected_graph, node, visited, subgraph)
            subgraphs.append(subgraph)
    return subgraphs

guoday commented 11 months ago

Regarding repo-level concatenation, I have a related question.

In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?

If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

guoday commented 11 months ago

Regarding repo-level concatenation, I have a related question. In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a? If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

In specific scenarios, such as the one described in https://github.com/deepseek-ai/DeepSeek-Coder/issues/43#issuecomment-1831433765, the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.

slamandar commented 11 months ago

Thank you for your prompt and detailed response!

One last question.

In one sample, is there a need for special token between the concatenated files? So that the model can distinguish that there are multiple files, and avoid model generate code like "import package" after the main content, in some downstream scenarios.

guoday commented 11 months ago

In fact, special token is required. However, we incorporate comments such as #utils.py and #model.py before each file to indicate to the model that the code completion is at the repository level.

slamandar commented 11 months ago

Completely understand. Thanks again for your quick response!

Bytes-Explorer commented 11 months ago

@guoday I was also wondering what do you do to the other files, like build files or metadata files? Thanks

vaisaxena commented 11 months ago

@guoday Thanks for the details above. It was quite helpful. One follow up question.

Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

dongs0104 commented 10 months ago

Regarding repo-level concatenation, I have a related question. In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a? If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

In specific scenarios, such as the one described in #43 (comment), the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.

@guoday But Node A and B is not connected in-direct graph so i have question about it what if A and B contain simair contents? did you get get_subgraphs and then re order it by repo level again?

guoday commented 10 months ago

Regarding repo-level concatenation, I have a related question. In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a? If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

In specific scenarios, such as the one described in #43 (comment), the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.

@guoday But Node A and B is not connected in-direct graph so i have question about it what if A and B contain simair contents? did you get get_subgraphs and then re order it by repo level again?

Nodes A and B are connected in an undirected graph, indicating that they are the same input sequence. If A and B have similar contents, B can leverage the content of A as additional context to enhance the completion process (sssuming in the sequence, B follows A). We do not re-order these nodes.

reignianor commented 10 months ago

Truly remarkable work! I am curious about the advantages of repo concatenation in your training process. Do you first pre-train using file-level code (at 4K window), and then continue-train with repo-level code (at 16K window)? What if pre-training using repo-level code at 4K window first?

zte-tcb commented 10 months ago

@guoday Thanks for the details above. It was quite helpful. One follow up question.

Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

I have the same doubts.

juncaofish commented 9 months ago

@guoday Thanks for the details above. It was quite helpful. One follow up question. Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

I have the same doubts.

Actually, the dependencies of file within the same repository could be represented as a DAG? and it's impossible for the case as your show for A->B, B->C, C->A, which would cause a circular reference problem

guoday commented 9 months ago

@guoday Thanks for the details above. It was quite helpful. One follow up question. Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

I have the same doubts.

Actually, the dependencies of file within the same repository could be represented as a DAG? and it's impossible for the case as your show for A->B, B->C, C->A, which would cause a circular reference problem

The algorithm employs a modified topological sort. Unlike the standard approach that selects nodes with zero in-degrees, this algorithm selects nodes with minimal in-degrees, which allows it to handle cycles within the graph.

slamandar commented 9 months ago

In the process of reproducing repository-level data concatenation, I have a question.

Is the file-level data or the unparsed language data(excluding python/java/c/c++/c#) included in the long-context continue pre-train dataset?

kail8 commented 8 months ago

Hi, I'm really impressed by your advanced work. I have am extra question: in the repo-level concatenation, if one file depends on some huge files or libraries (such as torch or transformers), the concatenated sample will inevitably exceed the window size / context length. How do you deal with this problem? @guoday

guoday commented 8 months ago

In the process of reproducing repository-level data concatenation, I have a question.

Is the file-level data or the unparsed language data(excluding python/java/c/c++/c#) included in the long-context continue pre-train dataset?

For unparsed language data or repository-level code that surpasses 32KB, we split them into file-level data for use in the continue pre-training.

guoday commented 8 months ago

Hi, I'm really impressed by your advanced work. I have am extra question: in the repo-level concatenation, if one file depends on some huge files or libraries (such as torch or transformers), the concatenated sample will inevitably exceed the window size / context length. How do you deal with this problem? @guoday

For unparsed language data or repository-level code that surpasses 32KB, we split them into file-level data for use in the continue pre-training.

kail8 commented 8 months ago

Hi, I'm really impressed by your advanced work. I have am extra question: in the repo-level concatenation, if one file depends on some huge files or libraries (such as torch or transformers), the concatenated sample will inevitably exceed the window size / context length. How do you deal with this problem? @guoday

For unparsed language data or repository-level code that surpasses 32KB, we split them into file-level data for use in the continue pre-training.

Got it. Thanks for your quick response!

Calvinnncy97 commented 7 months ago

May I know how are the dependencies parsed?

virtualzx-nad commented 7 months ago

hi @guoday ! when i use the model how do i structure my repo in my prompt to take advantage DeepSeek's understanding of repo structures? How should i separate different files in the same repo and how do i denote filenames? My repo also contains different languages so just adding # filename.py doesn't seem to be good enough.

deepseek-ai / DeepSeek-Coder

Repo level concatenation of data #43