crocs-muni / sec-certs

Tool for analysis of security certificates and their security targets (Common Criteria, NIST FIPS140-2...).
https://sec-certs.org
MIT License
12 stars 8 forks source link

Refactor FIPS & unify CLI actions #275

Closed adamjanovsky closed 1 year ago

adamjanovsky commented 1 year ago

High-level plan

Tests

FIPS

Processing steps progress:

Random notes to do there:

Done outside of the processing steps:

It is unclear what things like clean_cert_ids, or algorithms in the various certicate subobjects (PdfData and WebData) are, are they processed or raw, what sort of cleaning was done and what method needs to be called to do it? Specifically this cleaning that is done on FIPS certs is split between the dataset class and the certificate class in a nasty way, this should be unified and moved, maybe even to a separate class similar to how CC certificate ID canonicalization is separate.

Incomplete: See parse_html_main, the logic there just is not good.

Misc

Sanity check

Still part of refactoring, but to-be-addressed by separate PRs

J08nY commented 1 year ago

I applaud this effort! When I get more time hopefully I can help as well.

In the meantime, let me add a few pointers/issues I hit when working with the FIPS code:

adamjanovsky commented 1 year ago
* Data from the FIPS algorithm dataset is not utilized and mined fully. We can follow the links to the algorithm page and get more data that will help us.

Thanks for the input. This I'll probably leave for future work, this PR is going to be giant anyway. Could you please create a separate issue to address this?

J08nY commented 1 year ago
* Data from the FIPS algorithm dataset is not utilized and mined fully. We can follow the links to the algorithm page and get more data that will help us.

Thanks for the input. This I'll probably leave for future work, this PR is going to be giant anyway. Could you please create a separate issue to address this?

Did this in #276.

adamjanovsky commented 1 year ago

I'll re-implement the folder structure of FIPS Dataset. Outline:

dset
├── FIPS Dataset.json
├── auxillary_datasets
│   └── algorithms
│       └── fips_algorithms.json
├── certs
│   ├── modules
│   └── policies
│       ├── pdf
│       └── txt
└── web
    ├── fips_modules_active.html
    ├── fips_modules_historical.html
    └── fips_modules_revoked.html
adamjanovsky commented 1 year ago

@J08nY looking at the code here and there, I planned the following:

This comes with some advantages:

Some disadvantages that cross my mind:

Now's the time to speak up if you have comments/tips/whatever :).

J08nY commented 1 year ago

Interesting approach but I am worried about a few things:

I guess I am really not sure what problem you are trying to solve by passing the callables. I will have to look at the code.

J08nY commented 1 year ago

Looking at the code a bit. I think now I would just match the FIPS Dataset API to the CC Dataset API manually, even at the cost of some code duplication. I don't feel like it is a good time to rework the handling like this during this FIPS unification.

J08nY commented 1 year ago

Ah yes, lets do this: https://refactoring.guru/design-patterns/template-method and don't pass callables around. Just call the methods directly, even if there might be some duplication I think it makes sense for future flexibility.

adamjanovsky commented 1 year ago

I was thinking about the logic of the download testing. At the moment, the situation is as follows:

adamjanovsky commented 1 year ago

@J08nY Some resolution of the problem that you've pointed to in https://github.com/crocs-muni/sec-certs/pull/275/commits/90521384dc416b8dfb22f12b27e22e2f14ad50d6

adamjanovsky commented 1 year ago

@J08nY Do you have any idea how the cert_id normalization in FIPS works? I basically finished the refactoring, apart from:

IMO, ID of the certificate itself is fully determined by the contents from the module html page. What needs to get cleaned are various cert_ids detected as keywords in various sources: caveat text, pdf of the policy, etc. These point to other certificates, but are noisy and may point to Algorithms, or may just represent abitrary numbers, right?

So the goal, prior to constructing the graph of references, would be to prune the foreign cert_ids that are detected in FIPSCertificate. Does this make sense? Formally, there's no cert_id normalization or anything like that, you just delete detected foreign cert_ids that you consider weak in a sense.

Also, I'll probably try to convince @GeorgeFI to take a look at building a graph of references, and also the transitive vulnerabilities. He did this for CC, so it makes sense to let him do this also here.

What do you think?

J08nY commented 1 year ago

IMO, ID of the certificate itself is fully determined by the contents from the module html page. What needs to get cleaned are various cert_ids detected as keywords in various sources: caveat text, pdf of the policy, etc. These point to other certificates, but are noisy and may point to Algorithms, or may just represent abitrary numbers, right?

Yes, this is what was done in the original code in a way. The list of "certlike matches" was cleaned by removing the algorithm numbers the certificate references, either from the webpage or from the policy in some table (that has "algo" in the header :))

So the goal, prior to constructing the graph of references, would be to prune the foreign cert_ids that are detected in FIPSCertificate. Does this make sense? Formally, there's no cert_id normalization or anything like that, you just delete detected foreign cert_ids that you consider weak in a sense.

Yes this is what was done and is I think important to keep it that way, otherwise the reference graph has way too many false positives.

Also, I'll probably try to convince @GeorgeFI to take a look at building a graph of references, and also the transitive vulnerabilities. He did this for CC, so it makes sense to let him do this also here.

What do you think?

I don't think this is necessary, you can basically use ReferenceFinder the same way as it is (and as it is in CC), I actually sort of unified this part already, you just need to provide it with a source of cert_id references for a given graph. Originally there were two sources (and I think it makes sense to keep them), one from the page and one from the security policy. W.r.t. the transitive vulns computation, that should also be doable just with some renames to point to the correct clean cert_ids.

adamjanovsky commented 1 year ago

@J08nY Ok, this is ready for review I suppose. Once we merge this, I intend to do #287, #231 and then release.

I checked that both CC and FIPS pipeline pass, tests pass, and notebooks are all executable and are producing "not-completely-off" plots. :D

Enjoy the review and don't be too harsh 🤞

adamjanovsky commented 1 year ago

The pyupgrade was done partially, or rather, some files are missing the from future import annotations which enables the Python 3.9-type annotations on Python >= 3.7, which we need as we target 3.8. See the pyupgrade readme and PEP 585.

I see, this is because pyupgrade only works on files that have from __future__ import annotations at the top. I've added a flake8 plugin that checks for its absence so that this doesn't repeat in the future.

I've added these import statements and upgraded the rest of the codebase with https://github.com/crocs-muni/sec-certs/pull/275/commits/6ce7007ab5b5f721abf6d8659ef4bd54e4f61590

adamjanovsky commented 1 year ago

It seems the whole PP dataset was included in the test data, was this intended? It shows up at almost 77k lines. I believe just downloading it from our site is a reasonable alternative. Although as it is already in git, removing it now is pointless (without a rebase and history editing the repository will have it forever).

Oh this hurts. It was added by accident. I'll probably rewrite the history...

adamjanovsky commented 1 year ago

Connected with the root_dir handling and serialization is the DUMMY_PATH issue. I see the current solution as an acceptable one, although I think there is room for improvement, we just need to identify the core issue in the API design and assumptions to see what needs to change. I think the current situation arises because our assumptions in different places in the code are not consistent or clear, specifically assumptions about whether a dataset object is backed by its JSON (directory structure) at different points in time as it is constructed from different sources.

You're perfectly right about the cause.

One possible way to resolve this is to say that the dataset object is always backed by its directory structure/JSON and explicitly require the path in all constructors, even the web ones. Other possible way is the opposite, to not assume the dataset is always backed, then disable the serialization features unless a path is set.

I was actually opting for the second, which is kind of the current state. The serialization will fail unless a path is set. See

https://github.com/crocs-muni/sec-certs/blob/f14dfe3900baa0a87a51f42188646f409b9beeb9/sec_certs/serialization/json.py#L50-L51

It's just implemented with that DUMMY_NONEXISTING_PATH variable...

adamjanovsky commented 1 year ago

I dislike that the root_dir property setter on dataset classes actually serializes the dataset. I think this is highly unintuitive and unexpected. I propose a more explicit solution in a comment in the review. The TL;DR; is to create a different method for moving/copying the dataset and make it the only method to change the root_dir.

I agree this can be counterintuitive. I'll take a look at that, thanks for spotting this.