Reproduce dataset - Githubissues

VecherVhatuX commented 1 week ago

Thanks for your work—I’m interested in your research because I want to compare our method with yours. I’m planning to create my own dataset, and I have a few questions I hope you can help me with.

Suppose I have the following Pull Request:

https://github.com/sphinx-doc/sphinx/pull/7760

Let’s assume that during the fix commit, the test_build test was modified (some lines were added/deleted), and a new test, test_show_missing_items_quiet, was added (it didn’t exist before the fix commit). So, we now have 2 unit tests and 2 regression tests, which were not directly fixed but are also responsible for the functionality.

Questions:

Should we mark all 4 tests (test_build, test_show_missing_items_quiet, and the 2 regression tests) as is_bug == true, or should only the 2 unit tests that were modified be marked as such?

The actual patch that fixes the bug is as follows:

diff --git a/sphinx/ext/coverage.py b/sphinx/ext/coverage.py
--- a/sphinx/ext/coverage.py
+++ b/sphinx/ext/coverage.py
@@ -22,6 +22,7 @@
 from sphinx.builders import Builder
 from sphinx.locale import __
 from sphinx.util import logging
+from sphinx.util.console import red  # type: ignore
 from sphinx.util.inspect import safe_getattr

 logger = logging.getLogger(__name__)
@@ -121,6 +122,14 @@ def write_c_coverage(self) -> None:
                 write_header(op, filename)
                 for typ, name in sorted(undoc):
                     op.write(' * %-50s [%9s]\
 ' % (name, typ))
+                    if self.config.coverage_show_missing_items:
+                        if self.app.quiet or self.app.warningiserror:
+                            logger.warning(__('undocumented c api: %s [%s] in file %s'),
+                                           name, typ, filename)
+                        else:
+                            logger.info(red('undocumented  ') + 'c   ' + 'api       ' +
+                                        '%-30s' % (name + \" [%9s]\" % typ) +
+                                        red(' - in file ') + filename)
                 op.write('\
 ')

     def ignore_pyobj(self, full_name: str) -> bool:
@@ -239,16 +248,48 @@ def write_py_coverage(self) -> None:
                      if undoc['funcs']:
                          op.write('Functions:\
 ')
                          op.writelines(' * %s\
 ' % x for x in undoc['funcs'])
+                        if self.config.coverage_show_missing_items:
+                            if self.app.quiet or self.app.warningiserror:
+                                for func in undoc['funcs']:
+                                    logger.warning(
+                                        __('undocumented python function: %s :: %s'),
+                                        name, func)
+                            else:
+                                for func in undoc['funcs']:
+                                    logger.info(red('undocumented  ') + 'py  ' + 'function  ' +
+                                                '%-30s' % func + red(' - in module ') + name)
                          op.write('\
 ')
                      if undoc['classes']:
                          op.write('Classes:\
 ')
-                        for name, methods in sorted(
+                        for class_name, methods in sorted(
                                undoc['classes'].items()):
                              if not methods:
-                                op.write(' * %s\
 ' % name)
+                                op.write(' * %s\
 ' % class_name)
+                                if self.config.coverage_show_missing_items:
+                                    if self.app.quiet or self.app.warningiserror:
+                                        logger.warning(
+                                            __('undocumented python class: %s :: %s'),
+                                            name, class_name)
+                                    else:
+                                        logger.info(red('undocumented  ') + 'py  ' +
+                                                    'class     ' + '%-30s' % class_name +
+                                                    red(' - in module ') + name)
                              else:
-                                op.write(' * %s -- missing methods:\
 \
 ' % name)
+                                op.write(' * %s -- missing methods:\
 \
 ' % class_name)
                                  op.writelines('   - %s\
 ' % x for x in methods)
+                                if self.config.coverage_show_missing_items:
+                                    if self.app.quiet or self.app.warningiserror:
+                                        for meth in methods:
+                                            logger.warning(
+                                                __('undocumented python method:' +
+                                                   ' %s :: %s :: %s'),
+                                                name, class_name, meth)
+                                    else:
+                                        for meth in methods:
+                                            logger.info(red('undocumented  ') + 'py  ' +
+                                                        'method    ' + '%-30s' %
+                                                        (class_name + '.' + meth) +
+                                                        red(' - in module ') + name)
                          op.write('\
 ')

              if failed:
@@ -273,4 +314,5 @@ def setup(app: Sphinx) -> Dict[str, Any]:
    app.add_config_value('coverage_ignore_c_items', {}, False)
    app.add_config_value('coverage_write_headline', True, False)

Is it correct that the 4 functions (write_c_coverage, setup, ignore_pyobj, and write_py_coverage) should be marked as is_bug == true? Given that there are nested functions within these, how did you determine which functions are affected?
Do all functions from the project get included in snippet.json?

In order to make a thorough comparison of our methods, it's important that we use the same logic for the dataset Looking forward to your answers to help guide my dataset creation process. Thanks!

smkang96 commented 1 week ago

Hi @VecherVhatuX, thank you for your interest in our work.

For question 1 (test is_bug tag), my understanding of the benchmarks that we used (Defects4J and BugsInPy) is that the key criterion is whether a test reproduces the bug, in the sense that it fails on the buggy version and passes on the fixed version. The fact that a test was modified in the patch is not enough. To be clear, the actual bug-revealing tests that we used are presented in the failing_tests files; for tests, the is_bug tag does not do anything (it is meant as a marker for incorrect production code methods).
For question 2 (which functions are marked as is_bug), ignore_pyobj does not seem to be modified in the diff, and the diff for setup seems to be cut off? I think the other two functions are right, but because I am not aware of the full context I have difficulty in confidently assessing the situation. For nested functions, my policy was to use the "innermost" functions as units. That is, if method b is defined within method a and only b is modified, I would mark b as the culprit.
For question 3 (are all functions included), in both Defects4J and BugsInPy we included methods from the covered files, not all methods. In this research, we were worried that providing all information would overwhelm the LLM.

I hope this helps!

VecherVhatuX commented 5 days ago

For question 3 (are all functions included), in both Defects4J and BugsInPy we included methods from the covered files, not all methods. In this research, we were worried that providing all information would overwhelm the LLM.

I hope this helps!

I hope you're doing well. I have a question about your bug localization dataset.

Could you clarify what you mean by "covered files"? Specifically, does this term refer to using functions from the modified files (after applying patch fix) in your dataset? If so, this might simplify the task for the model. Identifying the correct file in the project is a separate challenge, and providing this information could lead to unintended hints or easier localization of bugs. In real bug fixes, we often don’t know where the bug is hiding—it could be anywhere, and we don’t have the actual fix.

I appreciate your insights on this matter!

smkang96 commented 5 days ago

Sure, good question. In our work, we take code from the files that were executed by the bug-revealing test, as defined in my answer to question 1. This can be measured via coverage tools - for example python has coverage.py, while for java iirc we used GZoltar results. So this information can be automatically derived, provided that we have the bug-revealing test. Hope this helps.

VecherVhatuX commented 5 days ago

Sure, good question. In our work, we take code from the files that were executed by the bug-revealing test, as defined in my answer to question 1. This can be measured via coverage tools - for example python has coverage.py, while for java iirc we used GZoltar results. So this information can be automatically derived, provided that we have the bug-revealing test. Hope this helps.

Thank you for your previous answer. We’ve analyzed around 100 opensource projects and pulled together your so-called coverage, averaging about 6,000 functions per project. Are you seriously including all of these, or is there some heuristic you’re using? This could totally mess with LLM model since too many packages lead to it saying, "Unfortunately, there are too many packages to fit in the context. Please try a different function"

ntrolls commented 5 days ago

When @smkang96 said we "include" those files, he probably did not mean that we include it in the LLM prompt. A major contribution of AutoFL is how we circumvent that using functions. So we maintain the list of what has been covered, and using that information, we implement the functions made available to the language model. The functions are designed in a way that allows the LLM to drill down into the codebase. For example, the covered class function will only give you the list of the classes, NOT the entire source code of all covered classes. Similarly, once you specify a class, the covered method function will give you the list of covered method of THAT class. Finally, only when you know the full signature of the method (i.e., both the class name and the method signature), you can use the code snippet function to retrieve the code. This way, the LLM context is NOT overwhelmed.

Hope this helps... unless I am missing something entirely.

VecherVhatuX commented 5 days ago

When @smkang96 said we "include" those files, he probably did not mean that we include it in the LLM prompt. A major contribution of AutoFL is how we circumvent that using functions. So we maintain the list of what has been covered, and using that information, we implement the functions made available to the language model. The functions are designed in a way that allows the LLM to drill down into the codebase. For example, the covered class function will only give you the list of the classes, NOT the entire source code of all covered classes. Similarly, once you specify a class, the covered method function will give you the list of covered method of THAT class. Finally, only when you know the full signature of the method (i.e., both the class name and the method signature), you can use the code snippet function to retrieve the code. This way, the LLM context is NOT overwhelmed.

Hope this helps... unless I am missing something entirely.

Guys, I get that you don’t send the source code to LLM, but we’re talking about dataset collection (or how you’re preprocessing the function list).

Suppose, when we run coverage run -m pytest test_requests.py::RequestsTestCase::test_DIGEST_AUTH_RETURNS_COOKIE, the list of files we get is massive. And those files are packed with functions—averaging around 6,000 functions IN COVERED FILES per project!

1) Did you actually take all the functions from coverage, or are you filtering some out? 2) I see you mostly have one test per project, but in our open-source project, we’ve got way more than one test. Have you ever faced this? If so, how the hell do you filter or choose functions from different tests?

Thank you for your answer

ntrolls commented 5 days ago

Okay, looks like I was also confused, getting in the middle of the thread. However, I guess the number of covered functions differs project by project. Are those all functions in your target project and actually covered by the single failing test? :)

VecherVhatuX commented 5 days ago

1) Yes, all 6,000 functions are from the covered file. Even if we only use the covered functions, there will still be around 2,000 functions. Those functions are covered by the several failing test :D 2) So, if there are several failings tests, how do you aggregate covered functions? Do you use intersection, union or another algorithm?

ntrolls commented 5 days ago

We do not perform any filtering, and I believe we only consider a single failing test with each run (so there is no aggregation). I guess the core technical issue here is: "what if the target system is so big that even the list of covered program elements is too big for the context window size?".

If it is too big, I guess it is too big... :) If there is any structural hierarchy above the class level, perhaps we can use that to further divide and conquer? For example, list of covered packages, from which you narrow down to classes, to methods, etc. But how effective this will be lies beyond the scope of our FSE paper.

VecherVhatuX commented 5 days ago

Thank you! Now I understood

coinse / autofl

Reproduce dataset #8