Closed VecherVhatuX closed 5 days ago
Hi @VecherVhatuX, thank you for your interest in our work.
is_bug
tag), my understanding of the benchmarks that we used (Defects4J and BugsInPy) is that the key criterion is whether a test reproduces the bug, in the sense that it fails on the buggy version and passes on the fixed version. The fact that a test was modified in the patch is not enough. To be clear, the actual bug-revealing tests that we used are presented in the failing_tests
files; for tests, the is_bug
tag does not do anything (it is meant as a marker for incorrect production code methods).ignore_pyobj
does not seem to be modified in the diff, and the diff for setup
seems to be cut off? I think the other two functions are right, but because I am not aware of the full context I have difficulty in confidently assessing the situation. For nested functions, my policy was to use the "innermost" functions as units. That is, if method b
is defined within method a
and only b
is modified, I would mark b
as the culprit.I hope this helps!
- For question 3 (are all functions included), in both Defects4J and BugsInPy we included methods from the covered files, not all methods. In this research, we were worried that providing all information would overwhelm the LLM.
I hope this helps!
I hope you're doing well. I have a question about your bug localization dataset.
Could you clarify what you mean by "covered files"? Specifically, does this term refer to using functions from the modified files (after applying patch fix) in your dataset? If so, this might simplify the task for the model. Identifying the correct file in the project is a separate challenge, and providing this information could lead to unintended hints or easier localization of bugs. In real bug fixes, we often don’t know where the bug is hiding—it could be anywhere, and we don’t have the actual fix.
I appreciate your insights on this matter!
Sure, good question. In our work, we take code from the files that were executed by the bug-revealing test, as defined in my answer to question 1. This can be measured via coverage tools - for example python has coverage.py, while for java iirc we used GZoltar results. So this information can be automatically derived, provided that we have the bug-revealing test. Hope this helps.
Sure, good question. In our work, we take code from the files that were executed by the bug-revealing test, as defined in my answer to question 1. This can be measured via coverage tools - for example python has coverage.py, while for java iirc we used GZoltar results. So this information can be automatically derived, provided that we have the bug-revealing test. Hope this helps.
Thank you for your previous answer.
We’ve analyzed around 100 opensource projects and pulled together your so-called coverage
, averaging about 6,000 functions per project. Are you seriously including all of these, or is there some heuristic you’re using? This could totally mess with LLM model since too many packages lead to it saying, "Unfortunately, there are too many packages to fit in the context. Please try a different function"
When @smkang96 said we "include" those files, he probably did not mean that we include it in the LLM prompt. A major contribution of AutoFL is how we circumvent that using functions. So we maintain the list of what has been covered, and using that information, we implement the functions made available to the language model. The functions are designed in a way that allows the LLM to drill down into the codebase. For example, the covered class function will only give you the list of the classes, NOT the entire source code of all covered classes. Similarly, once you specify a class, the covered method function will give you the list of covered method of THAT class. Finally, only when you know the full signature of the method (i.e., both the class name and the method signature), you can use the code snippet function to retrieve the code. This way, the LLM context is NOT overwhelmed.
Hope this helps... unless I am missing something entirely.
When @smkang96 said we "include" those files, he probably did not mean that we include it in the LLM prompt. A major contribution of AutoFL is how we circumvent that using functions. So we maintain the list of what has been covered, and using that information, we implement the functions made available to the language model. The functions are designed in a way that allows the LLM to drill down into the codebase. For example, the covered class function will only give you the list of the classes, NOT the entire source code of all covered classes. Similarly, once you specify a class, the covered method function will give you the list of covered method of THAT class. Finally, only when you know the full signature of the method (i.e., both the class name and the method signature), you can use the code snippet function to retrieve the code. This way, the LLM context is NOT overwhelmed.
Hope this helps... unless I am missing something entirely.
Guys, I get that you don’t send the source code to LLM, but we’re talking about dataset collection (or how you’re preprocessing the function list).
Suppose, when we run coverage run -m pytest test_requests.py::RequestsTestCase::test_DIGEST_AUTH_RETURNS_COOKIE
, the list of files we get is massive. And those files are packed with functions—averaging around 6,000 functions IN COVERED FILES per project!
1) Did you actually take all the functions from coverage, or are you filtering some out? 2) I see you mostly have one test per project, but in our open-source project, we’ve got way more than one test. Have you ever faced this? If so, how the hell do you filter or choose functions from different tests?
Thank you for your answer
Okay, looks like I was also confused, getting in the middle of the thread. However, I guess the number of covered functions differs project by project. Are those all functions in your target project and actually covered by the single failing test? :)
1) Yes, all 6,000 functions are from the covered file. Even if we only use the covered functions, there will still be around 2,000 functions. Those functions are covered by the several failing test :D 2) So, if there are several failings tests, how do you aggregate covered functions? Do you use intersection, union or another algorithm?
We do not perform any filtering, and I believe we only consider a single failing test with each run (so there is no aggregation). I guess the core technical issue here is: "what if the target system is so big that even the list of covered program elements is too big for the context window size?".
If it is too big, I guess it is too big... :) If there is any structural hierarchy above the class level, perhaps we can use that to further divide and conquer? For example, list of covered packages, from which you narrow down to classes, to methods, etc. But how effective this will be lies beyond the scope of our FSE paper.
Thank you! Now I understood
Thanks for your work—I’m interested in your research because I want to compare our method with yours. I’m planning to create my own dataset, and I have a few questions I hope you can help me with.
Suppose I have the following Pull Request:
https://github.com/sphinx-doc/sphinx/pull/7760
Let’s assume that during the fix commit, the
test_build
test was modified (some lines were added/deleted), and a new test,test_show_missing_items_quiet
, was added (it didn’t exist before the fix commit). So, we now have 2 unit tests and 2 regression tests, which were not directly fixed but are also responsible for the functionality.Questions:
test_build
,test_show_missing_items_quiet
, and the 2 regression tests) asis_bug == true
, or should only the 2 unit tests that were modified be marked as such?The actual patch that fixes the bug is as follows:
write_c_coverage
,setup
,ignore_pyobj
, andwrite_py_coverage
) should be marked asis_bug == true
? Given that there are nested functions within these, how did you determine which functions are affected?In order to make a thorough comparison of our methods, it's important that we use the same logic for the dataset Looking forward to your answers to help guide my dataset creation process. Thanks!