GitGuardian / ggshield

Find and fix 400+ types of hardcoded secrets and 70+ types of infrastructure-as-code misconfigurations.
https://gitguardian.com
MIT License
1.59k stars 139 forks source link

feat(sca): implement incidents output for autoignore feature #844

Closed carla-gitguardian closed 5 months ago

carla-gitguardian commented 5 months ago

The autoignore feature modified SCA scan classes returned by py-gitguardian. The added attributes should not be kept in the output schemas of ggshield.

codecov-commenter commented 5 months ago

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (70053de) 91.89% compared to head (c717e25) 91.81%.

Files Patch % Lines
ggshield/verticals/sca/output/json_handler.py 23.52% 13 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #844 +/- ## ========================================== - Coverage 91.89% 91.81% -0.08% ========================================== Files 168 167 -1 Lines 6981 6975 -6 ========================================== - Hits 6415 6404 -11 - Misses 566 571 +5 ``` | [Flag](https://app.codecov.io/gh/GitGuardian/ggshield/pull/844/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GitGuardian) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/GitGuardian/ggshield/pull/844/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GitGuardian) | `91.81% <23.52%> (-0.08%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GitGuardian#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

carla-gitguardian commented 5 months ago

Hi @agateau-gg, In this MR I wanted to exclude some attributes from the dataclasses returned by py-gitguardian. I did not find how to do it however, so I ended up redefining the classes in ggshield without the unwanted attributes to handle the output schemas. Would you think of a better way to do it?

agateau-gg commented 5 months ago

Hi @agateau-gg, In this MR I wanted to exclude some attributes from the dataclasses returned by py-gitguardian. I did not find how to do it however, so I ended up redefining the classes in ggshield without the unwanted attributes to handle the output schemas. Would you think of a better way to do it?

Just read the existing code and I have questions :)

In SCAJsonOutputHandler._process_scan_*_impl() methods: why do we deserialize the dict and then dump the object as JSON? It seems to me it would be simpler to run json.dumps() on the dict.

If there is an interest in serializing then deserializing, can you add a comment explaining it?

Assuming there is an interest in doing it, I think we can avoid basically duplicating all models by changing the code to use schema.to_dict() instead of schema.dumps() so that we have dictionaries and then remove the keys we don't want from said dictionaries.

agateau-gg commented 5 months ago

Another point: something similar has already be done on the IAC side, have a look at ggshield/verticals/iac/output/schemas.py.

carla-gitguardian commented 5 months ago

I'm not sure actually why it is serialized and then deserialized here, maybe @amascia-gg would have more clue?

As for IAC side, it had been done differently because schemas are exported from py-gitguardian for IaC whereas there are dataclasses for SCA. Maybe the change should directly occur in py-gitguardian to maintain better consistency? I don't know how much code would be impacted though.

agateau-gg commented 5 months ago

As for IAC side, it had been done differently because schemas are exported from py-gitguardian for IaC whereas there are dataclasses for SCA. Maybe the change should directly occur in py-gitguardian to maintain better consistency? I don't know how much code would be impacted though.

I am not sure I understand what difference you are referring to. Is it because in py-gitguardian IaC schemas are defined like this:

IaCVulnerabilitySchema = cast(
    Type[BaseSchema], marshmallow_dataclass.class_schema(IaCVulnerability, BaseSchema)
)

IaCVulnerability.SCHEMA = IaCVulnerabilitySchema()

Whereas SCA schemas are defined like this?

SCAIgnoredVulnerability.SCHEMA = cast(
    BaseSchema,
    marshmallow_dataclass.class_schema(
        SCAIgnoredVulnerability, base_schema=BaseSchema
    )(),
)

ie, if all that's missing is the SCAIgnoredVulnerabilitySchema class, then yes, let's define them in py-gitguardian.

carla-gitguardian commented 5 months ago

I guess it is yes! I'll define them in py-gitguardian and use them the same way we did in IaC. I'll also look into this potentially unnecessary serialization. Thanks for your help!

Paul-GitGuardian commented 5 months ago

Hi, some news:

There is no such issue for IAC because of the absence of datetime fields in the IAC schemas.

After spending some time trying to make it work, I decided to do as you suggested: discard the additional schemas and work with the dictionaries.

Paul-GitGuardian commented 5 months ago

One minor nitpick. Not a blocker.

One thing to consider in the future is using JSON schemas to document and validate sca JSON outputs. I started doing so for secret outputs. It works like this:

* Schemas are defined in doc/schemas

* They are loaded as fixtures here: https://github.com/GitGuardian/ggshield/blob/main/tests/conftest.py#L312-L337

* They are checked by tests, for example here: https://github.com/GitGuardian/ggshield/blob/main/tests/functional/secret/test_scan_path.py#L89

Ok that's great, I've made schemas for SCA, can you take a look?

I couldn't manage to import schema elements from other files, so I had to duplicate the locations_vulnerabilities block in both schemas. It looks like additional setup is required for that: https://python-jsonschema.readthedocs.io/en/latest/referencing/