Open Bytes-Explorer opened 11 months ago
We first parsing the dependencies of files, e.g. A->B, B->C, B->D. And then we rearrange the file positions based on their dependencies, e.g. A,B,C,D. For file paths, we add them on each code file as a comment. An example is shown in https://github.com/deepseek-ai/DeepSeek-Coder#4-repository-level-code-completion
Thank you for your response. Is this done for all languages in the data?
only for python, java, c#, c and c++
Thank you @guoday
@guoday Do you then do repo level dedup for all programming languages or just the above languages?
just the above languages. Other languages employ file level dedup.
@guoday Thank you for your prompt responses. I was curious if you did any ablation studies/evaluations to understand if repo level concatenation helped the model performance in a significant way.
Not yet. We will try to evaluate the model on repo-level benchmark. For function-level benchmark, the repo level concatenation doesn't help or hurt the model performance.
Do you have your own repo level benchmark or use a standard one?
We will use public datasets like RepoCoder and CrossCodeEval to evaluate.
Ok thanks, was aware of those. Once again, appreciate your prompt responses. I look forward to reading the technical report from your group. Thanks!
Hello, I would like to know the details of the concatenation of data. Assume that the structure of parsed dependencies is in the picture, what is the concatenation results? Is it ACF,ADF,ADG,BCF,BDF,BDG,BE? 7 pieces?
First, we select the file with the smallest incoming degree, and if there are multiple files with the smallest incoming degree, we randomly choose one. This process is repeated until a dependency order is obtained. For your example, there are many possibilities, one of which could be BACDFGE.
First, we select the file with the smallest incoming degree, and if there are multiple files with the smallest incoming degree, we randomly choose one. This process is repeated until a dependency order is obtained. For your example, there are many possibilities, one of which could be BACDFGE.
In other words, will all the files of the same language in a repo only concatenate one sample?
Theoretically, yes. However, to shorten the sample length, we will parse a repository in advance and then divide it into multiple independent subgraphs based on dependencies, with each independent subgraph regarding as a sample.
Thanks ! So what are the rules for dividing into subgraphs? Taking the picture I posted above as an example, what sub-pictures will it be divided into?
Regarding repo-level concatenation, I have a related question.
In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?
If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.
The term "independent subgraph" refers to a weakly connected subgraph. First, convert the directed graph into an undirected graph, and then divide the graph into multiple connected subgraphs. That is, in each subgraph, any two vertices should be connected by edges within the subgraph. In your example, it is a connected subgraph, with only one subgraph, which is itself. The following is the code to divide the graph into subgraphs.
# convert the directed graph into an undirected graph
def to_undirected(graph):
undirected_graph = defaultdict(set)
for node in graph:
undirected_graph[node]
for neighbor in graph[node]:
undirected_graph[node].add(neighbor)
undirected_graph[neighbor].add(node)
return undirected_graph
# Use DFS to find all connected subgraphs.
def dfs(graph, node, visited, subgraph):
visited[node] = True
subgraph.add(node)
for neighbor in graph[node]:
if not visited[neighbor]:
dfs(graph, neighbor, visited, subgraph)
# obtain all subgraphs
def get_subgraphs(graph):
undirected_graph = to_undirected(graph)
visited = {node: False for node in undirected_graph}
subgraphs = []
for node in undirected_graph:
if not visited[node]:
subgraph = set()
dfs(undirected_graph, node, visited, subgraph)
subgraphs.append(subgraph)
return subgraphs
Regarding repo-level concatenation, I have a related question.
In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?
If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.
If file_b
depends on file_a
, why is there a need for an attention mask to prevent file_b
from attending file_a
? Conversely, if file_b
doesn't depend on file_a
, we wouldn't concatenate these files into a single sample.
Regarding repo-level concatenation, I have a related question. In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a? If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.
If
file_b
depends onfile_a
, why is there a need for an attention mask to preventfile_b
from attendingfile_a
? Conversely, iffile_b
doesn't depend onfile_a
, we wouldn't concatenate these files into a single sample.
In specific scenarios, such as the one described in https://github.com/deepseek-ai/DeepSeek-Coder/issues/43#issuecomment-1831433765, the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.
Thank you for your prompt and detailed response!
One last question.
In one sample, is there a need for special token between the concatenated files? So that the model can distinguish that there are multiple files, and avoid model generate code like "import package" after the main content, in some downstream scenarios.
In fact, special token is required. However, we incorporate comments such as #utils.py
and #model.py
before each file to indicate to the model that the code completion is at the repository level.
Completely understand. Thanks again for your quick response!
@guoday I was also wondering what do you do to the other files, like build files or metadata files? Thanks
@guoday Thanks for the details above. It was quite helpful. One follow up question.
Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A
Regarding repo-level concatenation, I have a related question. In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a? If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.
If
file_b
depends onfile_a
, why is there a need for an attention mask to preventfile_b
from attendingfile_a
? Conversely, iffile_b
doesn't depend onfile_a
, we wouldn't concatenate these files into a single sample.In specific scenarios, such as the one described in #43 (comment), the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.
@guoday
But Node A
and B
is not connected in-direct graph so i have question about it what if A and B contain simair contents? did you get get_subgraphs
and then re order it by repo level again?
Regarding repo-level concatenation, I have a related question. In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a? If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.
If
file_b
depends onfile_a
, why is there a need for an attention mask to preventfile_b
from attendingfile_a
? Conversely, iffile_b
doesn't depend onfile_a
, we wouldn't concatenate these files into a single sample.In specific scenarios, such as the one described in #43 (comment), the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.
@guoday But Node
A
andB
is not connected in-direct graph so i have question about it what if A and B contain simair contents? did you getget_subgraphs
and then re order it by repo level again?
Nodes A
and B
are connected in an undirected graph, indicating that they are the same input sequence. If A and B have similar contents, B can leverage the content of A as additional context to enhance the completion process (sssuming in the sequence, B follows A). We do not re-order these nodes.
Truly remarkable work! I am curious about the advantages of repo concatenation in your training process. Do you first pre-train using file-level code (at 4K window), and then continue-train with repo-level code (at 16K window)? What if pre-training using repo-level code at 4K window first?
@guoday Thanks for the details above. It was quite helpful. One follow up question.
Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A
I have the same doubts.
@guoday Thanks for the details above. It was quite helpful. One follow up question. Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A
I have the same doubts.
Actually, the dependencies of file within the same repository could be represented as a DAG? and it's impossible for the case as your show for A->B, B->C, C->A, which would cause a circular reference problem
@guoday Thanks for the details above. It was quite helpful. One follow up question. Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A
I have the same doubts.
Actually, the dependencies of file within the same repository could be represented as a DAG? and it's impossible for the case as your show for A->B, B->C, C->A, which would cause a circular reference problem
The algorithm employs a modified topological sort. Unlike the standard approach that selects nodes with zero in-degrees, this algorithm selects nodes with minimal in-degrees, which allows it to handle cycles within the graph.
In the process of reproducing repository-level data concatenation, I have a question.
Is the file-level data or the unparsed language data(excluding python/java/c/c++/c#) included in the long-context continue pre-train dataset?
Hi, I'm really impressed by your advanced work. I have am extra question: in the repo-level concatenation, if one file depends on some huge files or libraries (such as torch
or transformers
), the concatenated sample will inevitably exceed the window size / context length. How do you deal with this problem? @guoday
In the process of reproducing repository-level data concatenation, I have a question.
Is the file-level data or the unparsed language data(excluding python/java/c/c++/c#) included in the long-context continue pre-train dataset?
For unparsed language data or repository-level code that surpasses 32KB, we split them into file-level data for use in the continue pre-training.
Hi, I'm really impressed by your advanced work. I have am extra question: in the repo-level concatenation, if one file depends on some huge files or libraries (such as
torch
ortransformers
), the concatenated sample will inevitably exceed the window size / context length. How do you deal with this problem? @guoday
For unparsed language data or repository-level code that surpasses 32KB, we split them into file-level data for use in the continue pre-training.
Hi, I'm really impressed by your advanced work. I have am extra question: in the repo-level concatenation, if one file depends on some huge files or libraries (such as
torch
ortransformers
), the concatenated sample will inevitably exceed the window size / context length. How do you deal with this problem? @guodayFor unparsed language data or repository-level code that surpasses 32KB, we split them into file-level data for use in the continue pre-training.
Got it. Thanks for your quick response!
May I know how are the dependencies parsed?
hi @guoday ! when i use the model how do i structure my repo in my prompt to take advantage DeepSeek's understanding of repo structures? How should i separate different files in the same repo and how do i denote filenames? My repo also contains different languages so just adding # filename.py
doesn't seem to be good enough.
Can you share more details on the technique for repo level concatenation part?