Fix issues with parsing and SARIF generation for SpaceROS

Ronoman commented 2 years ago

Before, the rules field in the SARIF output was polluted with many duplicate rules, as the full error message was included. Now, only the rule id that clang-tidy outputs between brackets (i.e. [google-explicit-constructor]) is taken and stored in rules.
Before, artifact paths sometimes had part or all of the error message in them. No more!
Before, the startLine and startColumn fields were strings. These are now integers.
Before, the result message had a lot more information than was needed. Now, it is just the error message (no location data or rule id).

This was tested on rclcpp and rcutils. ament_clang_tidy ran successfully on both, and outputted valid SARIF. However, I did have to change some of the RegEx's that were parsing clang_tidy output. I don't have a great way to check if I'm truly capturing all of the valid output of clang_tidy, so this may be unintentionally hiding some violations.

Ronoman commented 2 years ago

It seems like I made a poor assumption with some of the output, but I'm not sure how to resolve it. Most clang_tidy errors look like this:

/home/spaceros-user/src/spaceros/src/rcutils/src/char_array.c:153:14: warning: Call to function 'vsnprintf' is insecure as it does not provide security checks introduced in the C11 standard. Replace with analogous functions that support length arguments or provides boundary checks such as 'vsnprintf_s' in case of C11 [clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling]
  int size = vsnprintf(char_array->buffer, char_array->buffer_capacity, format, args_clone);
             ^

These are matched and groups are extracted properly with the regex as it stands. However, some output looks like this:

/home/spaceros-user/src/spaceros/src/rcutils/src/char_array.c:123:5: note: Call to function 'memcpy' is insecure as it does not provide security checks introduced in the C11 standard. Replace with analogous functions that support length arguments or provides boundary checks such as 'memcpy_s' in case of C11

Here, the rule ID is missing from the result. In most cases, these are preceded by the same exact error, like this block:

/home/spaceros-user/src/spaceros/src/rcutils/src/array_list.c:175:5: warning: Call to function 'memcpy' is insecure as it does not provide security checks introduced in the C11 standard. Replace with analogous functions that support length arguments or provides boundary checks such as 'memcpy_s' in case of C11 [clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling]
    memcpy(dst_ptr, src_ptr, array_list->impl->data_size * copy_count);
    ^
/home/spaceros-user/src/spaceros/src/rcutils/src/array_list.c:175:5: note: Call to function 'memcpy' is insecure as it does not provide security checks introduced in the C11 standard. Replace with analogous functions that support length arguments or provides boundary checks such as 'memcpy_s' in case of C11

In these cases, it's safe to throw away the second result since it is duplicating one already listed (note that both results are pointing to array_list.c:175:5). However, this isn't always the case. It seems like clang_tidy does not add the rule ID when it is reporting note: results, as opposed to warning: or error: results. How can we capture this different clang_tidy output properly in SARIF?

nuclearsandwich commented 2 years ago

In these cases, it's safe to throw away the second result since it is duplicating one already listed (note that both results are pointing to array_list.c:175:5). However, this isn't always the case. It seems like clang_tidy does not add the rule ID when it is reporting note: results, as opposed to warning: or error: results. How can we capture this different clang_tidy output properly in SARIF?

It doesn't look like clang-tidy itself supports formatting its output nor does its list of checks state what level a check is is (note, warning, or error). The ruleID is optional right? Can we just omit it for notes?

Ronoman commented 2 years ago

It doesn't look like clang-tidy itself supports formatting its output nor does its list of checks state what level a check is is (note, warning, or error). The ruleID is optional right? Can we just omit it for notes?

According to The Spec, it looks like we can omit result.ruleId in some cases. The relevant sentence (I think) is this:

If theDescriptor does not exist (that is, if theTool does not contain a reportingDescriptor object (§3.49) that describes the rule that was violated), then rule SHALL NOT be present.

theTool probably does have the relevant rule ID, but since the result for notes doesn't give it to us, I think we're okay to omit it. In most of the cases I've seen with clang_tidy, the result description field is usually sufficient to understand what the violation is, and what kind of action needs to be taken to resolve it.

Ronoman commented 2 years ago

Resolved all outstanding comments in 5d5e2eda. However, a new slight issue arises...

I've updated the regex rules to capture all clang_tidy results, even one that don't end in a [rule description]. However, this means that process_sarif is no longer going to de-duplicate results in clang_tidy properly. On line 106 of sarif_helpers, we check if the tuple (threeple?) (ruleId, artifact, region) has been seen before. With these changes, there are now results in clang_tidy that are exact duplicates, but some are missing the ruleId, so this check is no longer sufficient.

Is it okay to reduce the tuple to (artifact, region)? This would mean that any results that are pointing to the same file, with the exact same region (line number, sometimes column number) will be considered identical. I haven't yet observed two unique results pointing to the exact same artifact and region, so this seems like a reasonable assumption, but not always guaranteed to be true.

mjeronimo commented 2 years ago

Perhaps it would be more accurate to keep the (ruleId, artifact, region) and only make the reduction to (artifact, region) if necessary. In other words, ruleId would be optional in the comparison function, but used if present in both items to be compared.

ivanperez-keera commented 1 year ago

I'm following up on this (we have a PR in Space ROS that depends on this).

Does it still make sense to do this? If so, what's missing?

EzraBrooks commented 1 year ago

clang-tidy can be configured to output a YAML-formatted "fixes file" which is probably the way I'd recommend to ingest tidy output programmatically.

ament / ament_lint

Fix issues with parsing and SARIF generation for SpaceROS #419