Refactor FIPS & unify CLI actions

adamjanovsky commented 1 year ago

High-level plan

[x] 1. Unify CLIs into a single CLI (#212)
[x] 2. Migrate tests to pytest, separate download tests from others, refactor test structure (split into cc/fips/others)
- 76, #79
[x] 3. Refactor FIPS code
[x] 4. Misc: update docs, docker, readme, ...

Tests

[x] Add tests that check working pandas serialization
[x] Allow for default paths of CPEDataset and CVEDataset
- ~~Or forbid it and demand explicit paths (that will not get serialized omg). Either way, act consistently.~~
[x] Get rid of class design in the favour of fiixtures
[x] Refactor module design
[x] Test maintenance updates
[x] Select more appropriate names for the methods
[x] Migrate to pytest
[x] Divide logic of tests into download / rest.
[x] Allow download test to fail on unreliable resources, ~~test download on resources that we trust (e.g. our server)~~
- ~~Forbid download on non-download test by auto-use fixture that raises exception on requests.get / requests.post~~

FIPS

Processing steps progress:

[x] get_certs_from_web()
[x] process_auxillary_datasets()
[x] download_all_artifacts()
[x] convert_all_pdfs()
[x] analyze_certificates()
- [x] _compute_references()
- [x] _compute_transitive_vulnerabilities()

Random notes to do there:

[x] Refactor FIPS to hold auxilary datasets in the designated class
[x] Implement new folder structure
[x] Rewrite redo -> fresh, implement reatempting computations on failed samples
[x] Introduce InternalState both to FIPSCertificate and to FIPSDataset

Done outside of the processing steps:

[x] Delete plot graph methods
~~[ ] Add examples of plot graph into FIPS reference notebook~~
[x] Verify intermediate results on small-scale dataset before proceeding further
[x] ~~Verify if FIPS webpages are loading on scroll (historical modules). If yes, figure out how to force full load with BeautifulSoup~~
- This was not confirmed by manual analysis.
[x] Implement copying of FIPSDataset contents
[x] Unify CLIs, closes #212
[x] FIPSAlgorithmDataset should reside in own json in auxillary datasets

It is unclear what things like clean_cert_ids, or algorithms in the various certicate subobjects (PdfData and WebData) are, are they processed or raw, what sort of cleaning was done and what method needs to be called to do it? Specifically this cleaning that is done on FIPS certs is split between the dataset class and the certificate class in a nasty way, this should be unified and moved, maybe even to a separate class similar to how CC certificate ID canonicalization is separate.

[x] Address the above

Incomplete: See parse_html_main, the logic there just is not good.

[x] Address the above

Misc

[x] Refactor CC to hold auxilary datasets in the designated class
[x] Update docs
[x] Update Docker image
[x] Restrict usage of fresh on places where it does not makes sense
[x] Update project readme
[x] Revisit all TODOs (internal check)
[x] Unify logging messages.
[x] Update code documentation
[x] PyUpgrade and flake plugin into linting pipeline

Sanity check

[x] Perform full FIPS run
[x] Perform full CC run
[x] Run all notebooks

Still part of refactoring, but to-be-addressed by separate PRs

[x] Move content from examples to notebooks/examples
[x] #287
[x] Resolve dependabot issues https://github.com/crocs-muni/sec-certs/security/dependabot
[x] Across tool, unify from __future__ import annotations
[x] Test GH action yields some warnings about usage of deprecated functions, fix.
[x] https://github.com/crocs-muni/sec-certs/issues/231
[x] Release

J08nY commented 1 year ago

I applaud this effort! When I get more time hopefully I can help as well.

In the meantime, let me add a few pointers/issues I hit when working with the FIPS code:

The attributes of the certificate that are extracted and processed are unclear and incomplete.
- It is unclear what things like clean_cert_ids, or algorithms in the various certicate subobjects (PdfData and WebData) are, are they processed or raw, what sort of cleaning was done and what method needs to be called to do it? Specifically this cleaning that is done on FIPS certs is split between the dataset class and the certificate class in a nasty way, this should be unified and moved, maybe even to a separate class similar to how CC certificate ID canonicalization is separate.
- Incomplete: See parse_html_main, the logic there just is not good.
Data from the FIPS algorithm dataset is not utilized and mined fully. We can follow the links to the algorithm page and get more data that will help us.

adamjanovsky commented 1 year ago

* Data from the FIPS algorithm dataset is not utilized and mined fully. We can follow the links to the algorithm page and get more data that will help us.

Thanks for the input. This I'll probably leave for future work, this PR is going to be giant anyway. Could you please create a separate issue to address this?

J08nY commented 1 year ago

* Data from the FIPS algorithm dataset is not utilized and mined fully. We can follow the links to the algorithm page and get more data that will help us.
Thanks for the input. This I'll probably leave for future work, this PR is going to be giant anyway. Could you please create a separate issue to address this?

Did this in #276.

adamjanovsky commented 1 year ago

I'll re-implement the folder structure of FIPS Dataset. Outline:

dset
├── FIPS Dataset.json
├── auxillary_datasets
│   └── algorithms
│       └── fips_algorithms.json
├── certs
│   ├── modules
│   └── policies
│       ├── pdf
│       └── txt
└── web
    ├── fips_modules_active.html
    ├── fips_modules_historical.html
    └── fips_modules_revoked.html

adamjanovsky commented 1 year ago

@J08nY looking at the code here and there, I planned the following:

I decomposed whole processing pipeline into four steps: (i) parse metadata, (ii) download artifacts, (iii) process artifacts, (iv) analyze artifacts
Each child class would hold a property that would return a list of callables for each step, e.g. artifact_download_methods, see https://github.com/crocs-muni/sec-certs/blob/9433658b55c6025567fe4101dae017afc9b4f8a0/sec_certs/dataset/fips.py#L57-L58
Parent class would just orchestrate these steps, i.e. call all required methods (that must have no arguments which so far holds), see https://github.com/crocs-muni/sec-certs/blob/9433658b55c6025567fe4101dae017afc9b4f8a0/sec_certs/dataset/dataset.py#L234-L246
I try to introduce the state variables and execute the method on the individual samples in the same fashion as in CC world.

This comes with some advantages:

Refactoring is very fast, because I just follow the design already implemented in CC
It should also produce fewer errors, same reasoning as above

Some disadvantages that cross my mind:

Whatever we're unsatisfied with in CC will probably hit also FIPS 😆
There's a lot of spaghetti code that we could possibly get rid of. E.g., see _download_policies() and _download_modules(): https://github.com/crocs-muni/sec-certs/blob/9433658b55c6025567fe4101dae017afc9b4f8a0/sec_certs/dataset/fips.py#L94-L108, https://github.com/crocs-muni/sec-certs/blob/9433658b55c6025567fe4101dae017afc9b4f8a0/sec_certs/dataset/fips.py#L78-L92. They are very similar, just one variable that changes in multiple places + some logging. The same problem applies to methods executed on the individual samples
I am not certain whether we want to attempt to solve the spaghetti code by adding additional layer. First, I'm not sure how to do that without getattr("lame string of some variable"). Second, if we break the pattern, we would need to introduce new design...

Now's the time to speak up if you have comments/tips/whatever :).

J08nY commented 1 year ago

Interesting approach but I am worried about a few things:

Passing callables around smells. For example "(that must have no arguments which so far holds)" as a limitation sounds pretty bad. Maybe the solution here is to not have the superclass orchestrate but instead have it the other way around? I dunno, but passing callables seems to smell and locks us in a thing that is very constraining.
I would maybe have a quick look at design patterns to see if some fits? https://refactoring.guru/design-patterns/catalog

I guess I am really not sure what problem you are trying to solve by passing the callables. I will have to look at the code.

J08nY commented 1 year ago

Looking at the code a bit. I think now I would just match the FIPS Dataset API to the CC Dataset API manually, even at the cost of some code duplication. I don't feel like it is a good time to rework the handling like this during this FIPS unification.

J08nY commented 1 year ago

Ah yes, lets do this: https://refactoring.guru/design-patterns/template-method and don't pass callables around. Just call the methods directly, even if there might be some duplication I think it makes sense for future flexibility.

adamjanovsky commented 1 year ago

I was thinking about the logic of the download testing. At the moment, the situation is as follows:

Tests that don't test download functionality from unreliable servers (CC, FIPS), but use it (when this is difficult to get rid of) check whether the download succeeded and raise pytest.xfail if it did not
Tests that test download from unreliable servers were decorated with xfail. The problem with tested functionality is discovered only on repeated failures.
- We could possibly re-run these tests, but that would require pytest plugins?
Tests that download from reliable servers were left as they were.

adamjanovsky commented 1 year ago

@J08nY Some resolution of the problem that you've pointed to in https://github.com/crocs-muni/sec-certs/pull/275/commits/90521384dc416b8dfb22f12b27e22e2f14ad50d6

adamjanovsky commented 1 year ago

@J08nY Do you have any idea how the cert_id normalization in FIPS works? I basically finished the refactoring, apart from:

Cleaning up (I'll do that)
Cert ID normalization
Reference graph construction
Computation of transitive vulnerabilities

IMO, ID of the certificate itself is fully determined by the contents from the module html page. What needs to get cleaned are various cert_ids detected as keywords in various sources: caveat text, pdf of the policy, etc. These point to other certificates, but are noisy and may point to Algorithms, or may just represent abitrary numbers, right?

So the goal, prior to constructing the graph of references, would be to prune the foreign cert_ids that are detected in FIPSCertificate. Does this make sense? Formally, there's no cert_id normalization or anything like that, you just delete detected foreign cert_ids that you consider weak in a sense.

Also, I'll probably try to convince @GeorgeFI to take a look at building a graph of references, and also the transitive vulnerabilities. He did this for CC, so it makes sense to let him do this also here.

What do you think?

J08nY commented 1 year ago

IMO, ID of the certificate itself is fully determined by the contents from the module html page. What needs to get cleaned are various cert_ids detected as keywords in various sources: caveat text, pdf of the policy, etc. These point to other certificates, but are noisy and may point to Algorithms, or may just represent abitrary numbers, right?

Yes, this is what was done in the original code in a way. The list of "certlike matches" was cleaned by removing the algorithm numbers the certificate references, either from the webpage or from the policy in some table (that has "algo" in the header :))

So the goal, prior to constructing the graph of references, would be to prune the foreign cert_ids that are detected in FIPSCertificate. Does this make sense? Formally, there's no cert_id normalization or anything like that, you just delete detected foreign cert_ids that you consider weak in a sense.

Yes this is what was done and is I think important to keep it that way, otherwise the reference graph has way too many false positives.

Also, I'll probably try to convince @GeorgeFI to take a look at building a graph of references, and also the transitive vulnerabilities. He did this for CC, so it makes sense to let him do this also here.

What do you think?

I don't think this is necessary, you can basically use ReferenceFinder the same way as it is (and as it is in CC), I actually sort of unified this part already, you just need to provide it with a source of cert_id references for a given graph. Originally there were two sources (and I think it makes sense to keep them), one from the page and one from the security policy. W.r.t. the transitive vulns computation, that should also be doable just with some renames to point to the correct clean cert_ids.

adamjanovsky commented 1 year ago

@J08nY Ok, this is ready for review I suppose. Once we merge this, I intend to do #287, #231 and then release.

I checked that both CC and FIPS pipeline pass, tests pass, and notebooks are all executable and are producing "not-completely-off" plots. :D

Enjoy the review and don't be too harsh 🤞

adamjanovsky commented 1 year ago

The pyupgrade was done partially, or rather, some files are missing the from future import annotations which enables the Python 3.9-type annotations on Python >= 3.7, which we need as we target 3.8. See the pyupgrade readme and PEP 585.

I see, this is because pyupgrade only works on files that have from __future__ import annotations at the top. I've added a flake8 plugin that checks for its absence so that this doesn't repeat in the future.

I've added these import statements and upgraded the rest of the codebase with https://github.com/crocs-muni/sec-certs/pull/275/commits/6ce7007ab5b5f721abf6d8659ef4bd54e4f61590

adamjanovsky commented 1 year ago

It seems the whole PP dataset was included in the test data, was this intended? It shows up at almost 77k lines. I believe just downloading it from our site is a reasonable alternative. Although as it is already in git, removing it now is pointless (without a rebase and history editing the repository will have it forever).

Oh this hurts. It was added by accident. I'll probably rewrite the history...

adamjanovsky commented 1 year ago

Connected with the root_dir handling and serialization is the DUMMY_PATH issue. I see the current solution as an acceptable one, although I think there is room for improvement, we just need to identify the core issue in the API design and assumptions to see what needs to change. I think the current situation arises because our assumptions in different places in the code are not consistent or clear, specifically assumptions about whether a dataset object is backed by its JSON (directory structure) at different points in time as it is constructed from different sources.

You're perfectly right about the cause.

One possible way to resolve this is to say that the dataset object is always backed by its directory structure/JSON and explicitly require the path in all constructors, even the web ones. Other possible way is the opposite, to not assume the dataset is always backed, then disable the serialization features unless a path is set.

I was actually opting for the second, which is kind of the current state. The serialization will fail unless a path is set. See

https://github.com/crocs-muni/sec-certs/blob/f14dfe3900baa0a87a51f42188646f409b9beeb9/sec_certs/serialization/json.py#L50-L51

It's just implemented with that DUMMY_NONEXISTING_PATH variable...

adamjanovsky commented 1 year ago

I dislike that the root_dir property setter on dataset classes actually serializes the dataset. I think this is highly unintuitive and unexpected. I propose a more explicit solution in a comment in the review. The TL;DR; is to create a different method for moving/copying the dataset and make it the only method to change the root_dir.

I agree this can be counterintuitive. I'll take a look at that, thanks for spotting this.

crocs-muni / sec-certs