Create List of Changes - Githubissues

jssmith1 commented 7 years ago

A space to edit the response letter

jssmith1 commented 7 years ago

We thank the editor and reviewers for their detailed comments. We respond to all the reviewers' feedback inline below. Here we summarize the changes we made to address the two major critiques.

We followed the editor's suggestions in response to the first critique about the validity of our analysis. We elaborated on what participants were asked to do in the tasks by adding more details, including code excerpts, to Section 2.2.3. In Section 2.2.3, we also now specify how our tasks relate to tasks in the wild. In addition to these changes, we expanded on our threats to validity section (Section 6) to now discuss all the threats brought to our attention by the reviewers. Finally, in response to this critique, the second author validated the strategy tree analysis; details about this process can be found in the new Section 2.4.3.

The second critique suggests that our results should go further in understanding security vulnerabilities and how they relate general software development tasks. We addressed this critique by bolstering our discussion of related work. Throughout Section 3 we now compare and contrast our findings to those reported in previous studies. We also expanded section 5.2 to describe previous studies in more detail.

Editor Comments

Editor Comments to the Author: All three reviewers reported finding the work interesting, important, and well within scope for TSE. All had ideas for ways to improve the paper, which is to be expected for any submission. Where the reviewers diverged was whether these proposed improvements were minor, major, or fundamentally new work, requiring a new submission.

Both R1 and R2 believed that the paper needed a more nuanced discussion of prior work. There is summary in the introduction and discussion, but the paper really needs to go beyond just summary, describing how the discoveries in this paper add to, refine, or reject discoveries in prior work. I think most reviewers viewed this as a critical but minor revision.

Both R2 and R3 felt the paper needed an expanded discussion of limitations. R2 in particular mentioned wanting a more detail on the biases introduced by task selection, sample size, and think aloud protocol.

R2 and R3 also felt the work needed more replicability along multiple dimensions. For example, R2 was concerned that the data cited in citation [33] was hosted on a non-archival student website. I recommend, if feasible, incorporating the questions in an appendix in the draft, or if the length of the content is prohibitive, finding a way to provide at least a sampling of the questions as an appendix.

R3 also found the attack tree analysis inadequately specified, and sought a more detailed explanation of how the analysis was conducted so that others might verify or replicate the same analysis in future work.

In my view, all of the above changes are minor, and would be suitable for an accept with minor revisions. This leaves us with what I view as R3's two major critiques.

The first critique is whether RQ2, RQ3, and the analysis done to answer them, are both sufficiently valid and sufficiently deep to warrant publication. The concern about validity is whether the tasks that participants were asked to complete really focused enough on the the actual resolution of security defects, as opposed to just focusing on information seeking and hypothetical fixes. I think there is a legitimate, but ambiguous concern here about what constitutes an ecologically valid task. In R3's view, this was not valid, and thus not a contribution, and therefore an inadequate extension beyond the conference paper to justify publication. My view is that the task has limitations (potentially large ones), but that the work nevertheless advances our understanding of this understudied phenomenon. My recommendation would be to 1) more deeply elaborate on precisely what participants were asked to do in the task, 2) how this might differ from a more authentic task outside of the lab, and 3) in the discussion, detail the threats to generalizability that these differences might impose on the results. These changes might be clarifications, new definitions, or they might be more explicit arguments about what the tasks generalizability. I leave it to the authors to decide.

R3's second major critique is the lack of depth in the analysis on RQ2 and RQ3. Reviewer three found them shallow, but also loosely related to understanding security vulnerabilities. I agree with R3 and believe the analysis, results, and discussion could go deeper in trying to distinguish between the discoveries that are specific to security vulnerabilities and the discoveries that are more general to software development. This is tied to the shallow discussion of prior work, which, if deeper, would likely reveal some of the discoveries for RQ2 and RQ3 to be replications of prior work, and not new. This nuance, however, would likely also be balanced by a deeper understanding of what makes security vulnerabilities different from other kinds of defects.

Because of the clear set of minor revisions required, and multiple viable paths for two major revisions, I am recommending the paper be revised and resubmit for further review. I hope the authors will not only revise the writing to accomplish these major revisions, but also channel their expertise to thinking more deeply about the boundaries between vulnerability diagnosis and general debugging. I'm optimistic that this will make for an even stronger paper than the submission already is, and help better advance the field's knowledge in this important but understudied space.

Reviewers' Comments

Please note that some reviewers may have included additional comments in a separate file. If a review contains the note "see the attached file" under Section III A - Public Comments, you will need to log on to ScholarOne Manuscripts to view the file. After logging in, select the Author Center, click on the "Manuscripts with Decisions" queue and then clicking on the "view decision letter" link for this manuscript. You must scroll down to the very bottom of the letter to see the file(s), if any. This will open the file that the reviewer(s) or the Associate Editor included for you along with their review.

Reviewer: 1

Public Comments (these will be made available to the author) This paper is a longer version of a previously published work that appeared at the FSE conference. I feel that there is sufficient new material in this version so that it warrants publication. The authors did a good job of explaining what is new.

The topic is questions that developers try to answer while they are performing security-driven tasks with the FSB static analysis tool. They used 10 programmers in a lab study, and identified 559 questions sorted into 17 categories. This paper further investigates the successful and unsuccessful strategies that the participants used to try to answer the questions.

I found the lists of questions and strategies interesting and thought-provoking, and the discussion of implications at the end of the paper was helpful.

My main complaint is a lack of connection with the other papers that list questions, especially [27][28][42]. Those papers looked at different tasks than the current paper, which brings up the issue of to what extent would each of a programmer's tasks result in separate lists of questions? What percent of your questions are unique to security issues? It seems like lots of your questions (like all the ones about control flow and data flow, e.g., 3.3.4) would be identical to previously reported. Are there differences in these areas? If so, why? Are the strategies to solve them the same or different?

Several reviewers suggested we discuss the related work, specifically the other papers that list questions, more deeply. To that end, we expanded our discussion in two ways. First, we added a paragraph to each category in the results section. In each of those paragraphs, we relate our categories to findings from previous work. Where possible, we distinguish the security implications from the implications pertinent in general programming contexts. Second, we expanded Section 5.2 to highlight the methodological and domain differences between our study and previous information needs studies.

Also, are your questions specifically "hard-to-answer" [28] questions, or a mix of easy and hard questions?

Latoza and Myers surveyed developers and asked them to self-report which questions they thought were "hard to answer." We did not ask our participants to retroactively assess which of their questions were hard to answer.

Small writing issues:

p. 2, col. 1, line 45 - when you introduce Figure 1, say here that it is Eclipse for Java.

Fixed.

p. 2. col 2, 2.2.2 - you say you have novice and professional developers, but then don't say anything about the differences. Did you observe any differences based on experience? You don't bring this up until section 6, but still don't say anything about what you found out.

Identifying differences between students and professionals was not the aim of our study. Therefore, our analysis didn't differentiate between the two. Further, the sample of 10 participants is probably too small to meaningfully compare the two groups. We recruited from both populations to diversify the sample, and only report statistics about experience and job title to describe that diversity. We added the following line to the paper in hopes of clarifying these motivations: "We recruited both students and professionals to diversify the sample; our analysis does not otherwise discriminate between these two groups."

Section 3: For the results, are these in any particular order? Especially 3.3.* seem to reference each other, and you jump around from low-level to high-level and back to low level.

We tried to optimize the result ordering for readability by, for example, reducing the number of substantive forward and cross references. As a result, End-User Interaction appears last in its section and categories appear in the order in which the might arise during a task (e.g., Understanding and attack->Understanding fixes->Assessing the fix). Nonetheless, some references remain and there is still some flexibility in deciding how to order the results. We have experimented with ordering the results by the number of questions in each category or number of participants who asked questions in a category, but found these orderings overemphasized the quantitative aspects of our results. We are open to suggestions from the reviewers about how to specifically improve the ordering.

3.3.6 sounds like reachability questions. Is there any difference from [27]?

As we now discuss, the questions in 3.3.6 could be rephrased as reachability questions.

p. 10, col. 2 lines 8-12 - the discussion of CSRF is repetitive.

Fixed.

p. 12, col. 1, line 36 - typo "lead" -"led"

Fixed.

section 4.3 - this reminds me of the following system, that might be added along with the others you already mention:

Jeffrey Stylos and Brad A. Myers. “Mica: A Programming Web-Search Aid,” IEEE Symposium on Visual Languages and Human-Centric Computing, VL/HCC'06, Brighton, UK, Sept 4-8, 2006. pp. 195-202.

p. 14, col. 1, line 46: "Comparing the four approaches" -- seems like 3 are listed

Fixed.

p. 14, col. 2, line 3: "[22], [27], [28], [42]. These three studies" -- seems like 4

Fixed.

section 5.2 - need more than just a summary of these studies - you need to specifically point out which questions and strategies confirm what was already known, and which are different. And if you have any that contradict previous studies, that would be especially interesting.

As mentioned in our previous comment about related work changes, we added more details to the Results sections and Section 5.2.

Reviewer: 2

Public Comments (these will be made available to the author) The paper consists of 17 pages. The new part of the paper is roughly a third of the total contribution. In total the authors added 512 new lines and two new figure (compared to 914 exactly copied lines, two figures and three tables). The authors added two new research questions to the one that was already present in the previously published conference paper. On the downside, some answers to the new research questions were already included in the already published conference paper (3.3.2 Control Flow and Call Information, 3.3.3 Data Storage and Flow, 3.3.5 Application Context and Usage, 3.3.6 End-User Interaction, 3.4.2 Understanding Concepts) and there is just a small contribution for the third research question (about the assumptions developers make). Somewhere in the paper the authors state that they identified nearly 50 assumptions but they cannot be found in the paper nor in the online resource. Nevertheless, the topic of the paper is relevant since understanding software (and therefore understanding program comprehension) is a important topic during development and maintenance. Insights in this area can help to improve tools used by developers. These improved tools can than lead to better (e.g. more secure) software. Furthermore, such studies can help the research community to identify and focus on important research topics that help developers in their daily routine. The authors conduct a think-aloud study to derive questions developers pose while understanding and judging security bugs that were presented to them. Additionally, the authors identify participants strategies to gather the information they need to complete these tasks and the assumptions they made. This is a valid technique to get an insight in such situations. Conducting studies always pose the risk to be biased by the study design. In my opinion the authors have mentioned some of the key threats to validity but the list is not conclusive. I would ask the authors to be more detailed in this section to clearly state the limitations. Besides of this point the paper is technically sound. The paper is easy to follow but the title „How developers diagnose potential security vulnerabilities with static analysis.“ is somehow misleading. This title suggests that the article is about developers that employ different static analysis tools to identify vulnerabilities. What they actually do is to understand the results of a particular static security analysis tool and to justify a fix for the reported vulnerability.

We interpret this to be a question about whether we were studying multiple tools. To clarify this point, we changed the title to: "How Developers Diagnose Potential Security Vulnerabilities with a Static Analysis Tool."

Methodology:

The results (the concrete questions and strategies) are published online on one of the authors institution web sites and are linked from the article. Since the mentioned author is a Ph.D. student I am afraid that the page will vanish in a year or two (similar to the results of the original paper that are already offline). Is there a way to make the results more long-lasting? Since the results of the conference paper are not reachable anymore it is not possible to identify whether presented strategies were already in the data from the conference paper. This fact makes it harder to judge if the paper is novel.

We apologize that the material from the original submission was not available and agree that the material should be in a longer term archival location. Based on R2's suggestion, we added the questions to the appendix. The strategy analysis results span 42 pages, so we opted to exclude them from the appendix. Additionally, we moved all of the study materials, including the strategy results, to an archival website https://figshare.com/projects/How_Developers_Diagnose_Potential_Security_Vulnerabilities_with_Static_Analysis/24439.

Furthermore, it would be nice to extend these results with the assumptions you mention within the article to make them available to the research community (e.g. the assumptions).

We have made the full list of assumptions available in the online archive: https://figshare.com/account/projects/24439/articles/5449768

p4 l43: Do you really think that data flow questions are more relevant in the security context than in other contexts? From a tool vendor perspective a null pointer exception may be as relevant as a security issue and also leads to the question: Where does the null value come from?

We apologize for the ambiguity in our writing. It wasn't our intention to argue that data flow questions are more or less relevant in a security context than other contexts. Rather we wanted to point out that questions that don't appear to be uniquely about security can 1) have important security implications and 2) require special considerations in the context of security. We have revised the text in that section and hopefully convey the point more clearly.

To answer RQ2 (and I guess RQ3 as well) only one of the authors analyzed the transcripts for strategies (in difference to RQ 1 from the conference paper). Is there a valid reason? Are the extracted strategies and assumptions objective and complete?

The reviewers raise a sound concern over the validity of our strategy extraction process. The initial process was rather time-consuming; we estimate ~50 hours of effort. To assess the validity and completeness of our strategy trees, without duplicating all this effort, the second author reviewed all the strategy trees while watching the corresponding videos. We detail this process in the paper (Section 4.2.3).

Can you please clarify what an assumption is in your study? Is it simply related to facts that are not checked? For instance, if a subjects sees a method call sanitizeData(x) and he states „I assume the data is sanitized within the method.“ it would be an explicit assumption catched by your methodology. If he states: „The data is sanitized here.“ it would be an implicit assumption and not be found by your methodology? In particular, do you differentiate between correct and wrong assumptions and correct and wrong conclusions? Currently, it is hard to tell what an assumption is and what you can derive of these assumptions.

We clarified our definition of assumptions and how they relate to strategies (Section 2.5). To summarize, yes, they are related to facts that are not checked. Methodologically speaking, they occur when participants stated they were making an assumption using a keyword. We do distinguish between correct and wrong assumptions, but do not try to infer what participants' conclusions were.

The assumption mentioned on p5 l5 „For example, one participant’s assumed an external library was implemented securely“ cannot be found in the remainder of the article. I would expect that it is mentioned in the results section. (Another question is if this erroneous assumption had any impact on the performance of the subject)

We discuss assumptions about the usage and security of external libraries in the Code Background and Functionality results section. To elucidate the connection between the example we use on page 5 and its discussion later in the paper we added a reference to the results section. We also added a more detailed discussion of the assumption from page 5 (P4's assumption) to the results section.

Results:

Are there similar strategies for different question categories? It there are it would be interesting to discuss the strategies you identified independently from the questions programmers ask (of course there should be a link to the questions). This way it would be easier to identify more relevant strategies that answer questions from different question categories.

The reviewer inquires about strategies that cut across question categories. We discuss some of these strategies in Section 4.1 and Section 4.3. We aren't clear whether the reviewer is suggesting here that we restructure the results so that all the strategies are presented together. We experimented with this format in an earlier draft and ultimately decided to present strategies alongside the questions. While the alternative version makes it easier to identify cross-category strategies, the current version makes it much easier to trace between participants' questions and their strategies for answering them.

On p5 you state that you observed 73 assumptions in total. The remainder of the section mentions assumptions very rarely. Can you separate the strategies and assumptions (since they belong to different research questions) to make the identification more easier.

We choose to discuss the assumptions alongside the strategies because the assumption results are sparsely distributed throughout the results sections (see added discussion in Limitations) and because because the assumption results are closely related to the strategy results (See added discussion in Assumption Analysis). To make the assumptions easier to find, we added them to the web archive and the appendix.

On p11 (ch. 3.5.2.) you state that a sanity check within the DAOs would be an incorrect fix. Is it incorrect or just a „dirty“ solution? Would it be possible to fix the vulnerability by adding a sanity check in the DAO?

It would be possible to add sanitization to the DAO and fix the particular issue locally. However, this creates unnecessary code duplication and violates iTrust's architectural convention of organizing all validator classes in the /validate folder. Further, wherever else the faulty validator method was used would similarly have to be updated. We added this clarification to section 3.5.2.

On p12 you mention that the interviewer directed the subject to a part of the documentation he overlooked. As long as it wasn’t the last task for the subject: Did the interviewer change the behaviour of subject P3? What are the implications for the data point P3 creates? Can it be used for the evaluation?

It is possible that asking P3 to reexamine the notification text after completing T2 could have influenced that participant's subsequent tasks. In particular, by pointing out the reference links, the participant might have been more likely to follow those links in the future. However, the notification text for T3 and T4 did not include any reference links. We added this note as an aside to the relevant results section.

Discussion:

You state in your Related Work section that there are information needs studies with similar results. It would be helpful if you discuss the differences between the questions developers ask during „normal“ bug fixing or program understanding and „fixing“ security bugs. This discussion may help to answer the question if there are security bugs related questions/strategies/assumptions.

Reviewer 1 also suggested we discuss the differences between "normal" questions and the questions we identified. We added this discussion to the paper and distributed it throughout the results sections to facilitate comparison.

How did you bias the results by your task selection? You choose three tasks that are somehow related to missing input validation. Therefore, it is not very surprising that questions regarding control- and data-flow are posed since it is a dataflow problem. Do you expect other questions if you would have chosen other vulnerabilities such as missing authorization ?

R2 and R3 bring a valid threat to our attention; task selection may bias our results. We have revised the manuscript to acknowledge this threat in Section 6. Specifically, we added the paragraph starting with: "Another reason we cannot claim our categorization is comprehensive is because the questions and strategies may have been dependent on the four FSB notifications we chose and the order they were presented in. "

On p13 the authors mention that the participants didn’t follow the links within the FSB documentation. It this an indicator for an incomplete introduction of FSB? Do you think the results would change if you had mentioned them explicitly in the tool introduction?

All participants had experience with FindBugs before the study and were given a review of it's features during the briefing. We believe it is reasonable to expect participants were aware of the links before commencing the study. R2 is correct to imply that explicitly stressing the importance of the FSB links could have changed participants' behavior. In designing the briefing session, we were concerned that overemphasizing any of the environment's features, including the FSB links, would artificially influence behavior.

Threats to Validity:

Does the sample size influence your results?

Does the fact that you are in a think-aloud situation pose a threat to the validity of the results? Does the objective self-awareness change the behaviour of the participants? Are the self-reflection questions results of the self-awareness?

iTrust is a tool developed at the North Carolina State University implying that all developers from the sample have a similar educational background. Does it affect the results of your study?

The reviewers raised concerns about the homogeneity of our participant sample and the fact that security experts were not well-represented. We recognize these threats and now discuss them in Section 6. Specifically we added the paragraph containing the following sentence: "The participants we studied limit generalization and may not represent the range of developers who would use security tools. " We also added a paragraph to Section 6 that discusses the potential confounds introduced by our think aloud methodology.

Reviewer: 3

Public Comments (these will be made available to the author) SUMMARY: This paper reports on an exploratory study about the defect resolution process of software developers when using security tools. The authors observe the interactions of 10 developers (5 of them are students) while using a static analysis tools called Find Security Bugs to resolve security vulnerabilities. Static analysis tools aim to help developers remove security defects during the development cycle, before code executes. In their published conference paper the authors focus on which questions ‘developers’ ask while resolving security defects, (i.e. RQ1. What information do developers need while using static analysis tools). This journal submission uses the same dataset and focuses on how developers answer those questions: RQ2. what strategies developers use to acquire the required information, and RQ3. What assumptions do developers make while executing the strategies. The authors describe strategic successes and failures of observed developers that can be used by tools to encourage better strategies.

MAIN STRENGTH:

Interesting topic for further development of vulnerabilities static analyses tools.

Overall, the paper is well written and easy to follow.

MAIN WEAKNESSES: 1- The increment between the conference paper and journal submission is minimal and does not justify a new publication in my opinion: This is the list of extensions (the rest is identical):

Section 2.1: RQ2 and RQ3, which are dependent of RQ1 (was answered in the FSE paper).

Section 2.4: which does not explain how the authors came up with the strategy tree, nor how the authors came up with failure/success criteria. There is also no mention to a stopping criteria to decide when a strategy was finished. This can also be hardly answered since the data were collected for a different reason.

We separated Section 2.4 into two sections and expanded on each of the two sections, "Strategy Definitions" and "Strategy Extraction." As we mentioned in response to R2, we also added a section describing the additional strategy tree validation we conducted. In the "Strategy Definitions" section we clarify our definitions of strategies and strategy trees. We now describe how strategy trees relate to some other representations of strategies and articulate stopping criteria that were used to determine what to exclude from the trees. Finally, we added a paragraph outlining our success/failure criteria to the "Strategy Extraction" section.

The sections 3.2-3.5 which report the new findings (i.e., strategies and assumptions) are not well developed. They only report the strategies: some are very short (e.g., 3.2.4). Some are very general and do not apply necessarily to the context of vulnerability (e.g., 3.3.1).

We reported the assumptions where they could be associated with information needs. We now report them more completely in the appendix. We summarized our strategies for brevity, especially where we observed relatively few strategies. The full 40 pages can be found in the online archive. We agree with the reviewer that some strategies are general, but don't view it as a problem. The generality suggests improving these strategies would also benefit developers in diagnosing security vulnerabilities.

In the discussion, Section 4.3 is also very general and not novel. It is unclear what the implications are.

For implications, we sketch three different tools that could help developers satisfy their information needs and execute strategies more efficiently. While the tools we outline draw inspiration from existing tools, we believe them to be novel.

2- If the authors are interested in understanding and learning from developers how security vulnerabilities are tackled in development tasks, then I would expect to see a new study with different sample, goals and perhaps even method (e.g. interviews). The original study focused on the information needs. Studying students or security novices is ok or might even meaningful to understand issues. Reusing the data to study ‘how developers diagnose potential vulnerabilities’ is rather inappropriate. Even if the full sessions (including the questions and how developers tried to answer them) were recorded, the observers, the data collection procedure, and the sample were focusing on the information needs and not resolution strategies.

The reviewer is correct, if we analyzed the questions alone, that would not provide us with sufficient data for identifying resolution strategies. As the reviewer implies, to identify the strategies we set aside the analyzed questions and started with the full session data (transcribed audio/video recordings). Because we asked developers to propose solutions, it seems appropriate to us to focus on resolution strategies.

3- Strategies and assumptions are highly dependent on the selected security defects and developers: what is the prior knowledge of the developers regarding the security defects, how frequent are the security defects discussed on the web. The selection of tasks should have been broader to strengthen the results.

As noted in our response to R2's similar issue, we added this threat to Section 6

Reading the procedure raises the assumption that none of the developers is a security expert (e.g. randomly browsing StackOverflow posts or clicking tool-hints).

4- I would like to see an additional experiment where developers use a mocked tool which takes into account the proposed strategies, to have a comparison to the tool used in the explorative study regarding success/failure rate and time consumption.

We appreciate R3's suggestion to conduct an additional study with a mocked tool. In fact, it is a line of research we are actively pursuing. We have created one prototype tool that we call Flower based on the ideas we outlined in 4.1. Our preliminary study evaluating this tool will be published at VL/HCC soon (preprint: http://www4.ncsu.edu/~jssmit11/Publications/VLHCC17_Flower.pdf). Admitedly, this tool only addresses some of the strategic failures. We have also created mockups for the tool described in section 4.2, though we have yet to evaluate these mocks. We have updated the discussion sections to include links to relevant materials pertaining to both of these tools.

4- The strategies the authors finally provide are too general and could, e.g., be used to optimize IDEs rather than security tools.

OTHER COMMENTS:

The title (and for the rest of the paper) implies that developers actually care about vulnerabilities. Is this really the case?

In Section 1 we cite Christakis2016, which provides evidence that developers at least self-report that they care more about security issues than any other type of code issue.

Table 3 could be better organized, for example, by referencing which kind of vulnerability (Task 1-4) the category applies to.

Regarding Table III, the list of categories, we implemented R3's suggestion by adding a "Tasks" column to the table. This column reports which categories of questions were observed during each task. The column shows that nearly every question category was observed during all four tasks.

Possibly the sequence of the task can mask some the strategies and assumptions. For example, while browsing Stack Overflow for understanding the attack in Task 1 (e.g., Section 3.2.1) one could find information regarding the attack in Task 2. Therefore, when tackling Task 2 the strategy does not manifest.

Though none of the tasks were dependent on each other in the way the reviewer describes, we do now mention the threat that the fixed task sequencing imposes in Section 6.

In general, the authors should elaborate a bit more about the tasks. Are they somehow connected? What would be the impact?

We elaborated on what participants were asked to do in the tasks and described in more detail how our tasks relate to tasks in the wild. We already include the verbatim instructions given to participants in the appendix, but have expanded on the task descriptions in the Methodology section. In particular, we added code experts for each task and some more details about what was involved in each task. To help clarify the difference between our tasks and "more authentic" tasks outside the lab we added the paragraph starting with: "The tasks we chose encompass a subset of the vulnerability mediation activities in the wild." The tasks were not connected beyond the fact that they were all derived from the same code base. We added more details about each of the four tasks to the Methodology section under Tasks

Section 3.2.2: the distinction between alternative and fix is unnecessary. A fix (e.g., a secure way to achieve a behavior) is always an alternative (to an insecure way of achieving the same behavior).

We agree with the reviewer that the terms 'Alternative' and 'Fix' are confusingly similar. We renamed the category to "Understanding Approaches and Fixes"

One interesting aspect would be to see how the reported behavior compare to a traditional (i.e., not focused on security) static analyzer.

There are two strong assumptions that make the study resembling a less real world scenario: i) it seems that performance might not be an issue (i.e., Section 3.2.3); ii) it cannot be observed whether the selected strategy in turns introduces a bug.

As we now discuss in the Methodology section, our tasks do not encompass all the types of activities involved in defect resolution in the wild, like testing whether a proposed fix introduces a bug. We do, however, believe we capture an important set of activities involved in defect resolution, namely vulnerability diagnosis.

One suggestion is to present the categories as a graph, rather than a list, since there seems to be dependencies among categories.

R3 correctly notes the dependencies among categories. However, we are hesitant to restructure the table as a graph, because it remains unclear to us when to include/exclude an edge. Our methodology does not systematically explore the relationships between categories. In the results sections we do discuss connections between categories as they arise, primarily to articulate the differences between similar seeming categories --- these differences were points of confusion for reviewers in previous iterations of the paper.

The discussion is weak. There are some implications for industry (i.e., tool vendors) but not precise way for researchers to build on the results. It would be good if, for each finding, the authors state one (or more) actionable hypothesis that explain (given their knowledge of the topic) why the strategy they observe takes place. This would allow others to verify such hypothesis.

Identifying the strategies participants executed is much easier than identifying participants' motivations, especially when participants don't explicitly articulate their motivations. We've taken the reviewers suggestion and added a few hypotheses to the results sections. If there remain specific categories where the reviewer wonders why one strategy takes place over another, we can do our best to speculate.

My main takeaway is that there is a misalignment between the tool developers’ knowledge and the users’ knowledge (this can simply happen because the tool was developed to be used by experts, but the subjects of the study are not). However, the paper does not answer the question: How can this gap be filled?

Three ways tools can be aligned better with users needs are outlined in Section 4. If the question is more broadly how do toolsmiths empathize with users, then that seems beyond the scope of this work.

jssmith1 commented 7 years ago

Summarize change for editor, put bulk of comments in the reviewer's issues.

CaptainEmerson commented 7 years ago

Good work!

Still need to:

[x] "Summarize change for editor", as in prior comment
[x] Need section numbering
[x] "when you introduce Figure 1, say here that it is Eclipse for Java." No response. There are other places (mostly minor) where there's no response; you at least need to say "fixed"
[x] "hard to classify any of our questions as hard" repetitive
[x] "Identifying differences between students and professional". I'd add that the sample of 10 participants is probably too small to meaningfully compare the two groups.
[x] "We thank the reviewer for bringing these small writing issues". The issue before this one is not a "small" writing issue.
[ ] The Acrobat diff has lots of false positives; see if latex-diff does better
[x] "did not pursue strategies actively." Not sure what this means.
[x] "potential threat" is redundant. Just "threat"
[x] " a few hypothesis" -> " a few hypotheses"

jssmith1 commented 7 years ago

We thank the editor and reviewers for their detailed comments. We respond to all the reviewers' feedback inline below. Here we summarize the changes we made to address the two major critiques.

We followed the editor's suggestions in response to the first critique about the validity of our analysis. We elaborated on what participants were asked to do in the tasks by adding more details, including code excerpts, to Section 2.2.3. In Section 2.2.3, we also now specify how our tasks relate to tasks in the wild. In addition to these changes, we expanded on our threats to validity section (Section 6) to now discuss all the threats brought to our attention by the reviewers. Finally, in response to this critique, the second author validated the strategy tree analysis; details about this process can be found in the new Section 2.4.3.

The second critique suggests that our results should go further in understanding security vulnerabilities and how they relate general software development tasks. We addressed this critique by bolstering our discussion of related work. Throughout Section 3 we now compare and contrast our findings to those reported in previous studies. We also expanded section 5.2 to describe previous studies in more detail.

jssmith1 commented 7 years ago

@CaptainEmerson:

As I mentioned in person, striking the latex-diff from the todo list for now. Once I regain access to the submission form I may reconsider if they require a pdf diff.
Assigning this task to you in hopes of you reviewing the previous comment in which we respond to the editor's main critiques. Otherwise, I fixed all the response letter issues you raised. Please just assign back to me once you have reviewed.

CaptainEmerson commented 7 years ago

LGTM

DeveloperLiberationFront / iTrustInterviews

Create List of Changes #413